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PREFACE 


The scope of the theory of spin glasses has been expanding well beyond its origi- 
nal goal of explaining the experimental facts of spin glass materials. For the first 
time in the history of physics we have encountered an explicit example in which 
the phase space of the system has an extremely complex structure and yet is 
amenable to rigorous, systematic analyses. Investigations of such systems have 
opened a new paradigm in statistical physics. Also, the framework of the analyti- 
cal treatment of these systems has gradually been recognized as an indispensable 
tool for the study of information processing tasks. 

One of the principal purposes of this book is to elucidate some of the im- 
portant recent developments in these interdisciplinary directions, such as error- 
correcting codes, image restoration, neural networks, and optimization problems. 
In particular, I would like to provide a unified viewpoint traversing several dif- 
ferent research fields with the replica method as the common language, which 
emerged from the spin glass theory. One may also notice the close relationship 
between the arguments using gauge symmetry in spin glasses and the Bayesian 
method in information processing problems. Accordingly, this book is not neces- 
sarily written as a comprehensive introduction to single topics in the conventional 
classification of subjects like spin glasses or neural networks. 

In a certain sense, statistical mechanics and information sciences may have 
been destined to be directed towards common objectives since Shannon formu- 
lated information theory about fifty years ago with the concept of entropy as the 
basic building block. It would, however, have been difficult to envisage how this 
actually would happen: that the physics of disordered systems, and spin glass 
theory in particular, at its maturity naturally encompasses some of the impor- 
tant aspects of information sciences, thus reuniting the two disciplines. It would 
then reasonably be expected that in the future this cross-disciplinary field will 
continue to develop rapidly far beyond the current perspective. This is the very 
purpose for which this book is intended to establish a basis. 

The book is composed of two parts. The first part concerns the theory of 
spin glasses. Chapter 1 is an introduction to the general mean-field theory of 
phase transitions. Basic knowledge of statistical mechanics at undergraduate 
level is assumed. The standard mean-field theory of spin glasses is developed 
in Chapters 2 and 3, and Chapter 4 is devoted to symmetry arguments using 
gauge transformations. These four chapters do not cover everything to do with 
spin glasses. For example, hotly debated problems like the three-dimensional spin 
glass and anomalously slow dynamics are not included here. The reader will find 
relevant references listed at the end of each chapter to cover these and other 
topics not treated here. 
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The second part deals with statistical-mechanical approaches to information 
processing problems. Chapter 5 is devoted to error-correcting codes and Chapter 
6 to image restoration. Neural networks are discussed in Chapters 7 and 8, and 
optimization problems are elucidated in Chapter 9. Most of these topics are 
formulated as applications of the statistical mechanics of spin glasses, with a 
few exceptions. For each topic in this second part, there is of course a long 
history, and consequently a huge amount of knowledge has been accumulated. 
The presentation in the second part reflects recent developments in statistical- 
mechanical approaches and does not necessarily cover all the available materials. 
Again, the references at the end of each chapter will be helpful in filling the gaps. 
The policy for listing the references is, first, to refer explicitly to the original 
papers for topics discussed in detail in the text, and second, whenever possible, 
to refer to review articles and books at the end of a chapter in order to avoid an 
excessively long list of references. I therefore have to apologize to those authors 
whose papers have only been referred to indirectly via these reviews and books. 

The reader interested mainly in the second part may skip Chapters 3 and 4 
in the first part before proceeding to the second part. Nevertheless it is recom- 
mended to browse through the introductory sections of these chapters, including 
replica symmetry breaking (§§3.1 and 3.2) and the main part of gauge theory 
(§§4.1 to 4.3 and 4.6), for a deeper understanding of the techniques relevant to 
the second part. It is in particular important for the reader who is interested in 
Chapters 5 and 6 to go through these sections. 

The present volume is the English edition of a book written in Japanese 
by me and published in 1999. I have revised a significant part of the Japanese 
edition and added new material in this English edition. The Japanese edition 
emerged from lectures at Tokyo Institute of Technology and several other uni- 
versities. I would like to thank those students who made useful comments on 
the lecture notes. I am also indebted to colleagues and friends for collabora- 
tions, discussions, and comments on the manuscript: in particular, to Jun-ichi 
Inoue, Yoshiyuki Kabashima, Kazuyuki Tanaka, Tomohiro Sasamoto, Toshiyuki 
Tanaka, Shigeru Shinomoto, Taro Toyoizumi, Michael Wong, David Saad, Peter 
Sollich, Ton Coolen, and John Cardy. I am much obliged to David Sherrington 
for useful comments, collaborations, and a suggestion to publish the present En- 
glish edition. If this book is useful to the reader, a good part of the credit should 
be attributed to these outstanding people. 


Tokyo 
February 2001 
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MEAN-FIELD THEORY OF PHASE TRANSITIONS 


Methods of statistical mechanics have been enormously successful in clarifying 
the macroscopic properties of many-body systems. Typical examples are found 
in magnetic systems, which have been a test bed for a variety of techniques. 
In the present chapter, we introduce the Ising model of magnetic systems and 
explain its mean-field treatment, a very useful technique of analysis of many- 
body systems by statistical mechanics. Mean-field theory explained here forms 
the basis of the methods used repeatedly throughout this book. The arguments in 
the present chapter represent a general mean-field theory of phase transitions in 
the Ising model with uniform ferromagnetic interactions. Special features of spin 
glasses and related disordered systems will be taken into account in subsequent 
chapters. 


1.1 Ising model 


A principal goal of statistical mechanics is the clarification of the macroscopic 
properties of many-body systems starting from the knowledge of interactions 
between microscopic elements. For example, water can exist as vapour (gas), 
water (liquid), or ice (solid), any one of which looks very different from the oth- 
ers, although the microscopic elements are always the same molecules of H20. 
Macroscopic properties of these three phases differ widely from each other be- 
cause intermolecular interactions significantly change the macroscopic behaviour 
according to the temperature, pressure, and other external conditions. To inves- 
tigate the general mechanism of such sharp changes of macroscopic states of 
materials, we introduce the Ising model, one of the simplest models of interact- 
ing many-body systems. The following arguments are not intended to explain 
directly the phase transition of water but constitute the standard theory to de- 
scribe the common features of phase transitions. 

Let us call the set of integers from 1 to N, V = {1,2,...,N} = {ijh=1,.... N, 
a lattice, and its element i a site. A site here refers to a generic abstract object. 
For example, a site may be the real lattice point on a crystal, or the pixel of 
a digital picture, or perhaps the neuron in a neural network. These and other 
examples will be treated in subsequent chapters. In the first part of this book 
we will mainly use the words of models of magnetism with sites on a lattice for 
simplicity. We assign a variable S; to each site. The Ising spin is characterized 
by the binary value S; = +1, and mostly this case will be considered throughout 
this volume. In the problem of magnetism, the Ising spin S; represents whether 
the microscopic magnetic moment. is pointing up or down. 
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Fic. 1.1. Square lattice and nearest. neighbour sites (ij) on it 


A bond is a pair of sites (ij). An appropriate set of bonds will be denoted as 
B = {(ij)}. We assign an interaction energy (or an interaction, simply) —J SiS} 
to each bond in the set B. The interaction energy is —J when the states of the 
two spins are the same (S; = S;) and is J otherwise (S; = —S;). Thus the 
former has a lower energy and is more stable than the latter if J > 0. For the 
magnetism problem, S; = 1 represents the up state of a spin (f) and S; = —1 
the down state (|), and the two interacting spins tend to be oriented in the 
same direction (Tf or ||) when J > 0. The positive interaction can then lead 
to macroscopic magnetism (ferromagnetism) because all pairs of spins in the set 
B have the tendency to point in the same direction. The positive interaction 
J > 0 is therefore called a ferromagnetic interaction. By contrast the negative 
interaction J < 0 favours antiparallel states of interacting spins and is called an 
antiferromagnetic interaction. 

In some cases a site has its own energy of the form —hS;, the Zeeman energy 
in magnetism. The total energy of a system therefore has the form 


N 
H=-J X S:S;-h)_ Si. (1.1) 
1 


(ij)EB = 


Equation (1.1) is the Hamiltonian (or the energy function) of the Ising model. 

The choice of the set of bonds B depends on the type of problem one is 
interested in. For example, in the case of a two-dimensional crystal lattice, the 
set of sites V = {i} is a set of points with regular intervals on a two-dimensional 
space. The bond (ij) (€ B) is a pair of nearest neighbour sites (see Fig. 1.1). 
We use the notation (ij) for the pair of sites (ij) € B in the first sum on the 
right hand side of (1.1) if it runs over nearest neighbour bonds as in Fig. 1.1. By 
contrast, in the infinite-range model to be introduced shortly, the set of bonds 
B is composed of all possible pairs of sites in the set of sites V. 

The general prescription of statistical mechanics is to calculate the thermal 
average of a physical quantity using the probability distribution 

e-6H 


P(S) => (1.2) 
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for a given Hamiltonian H. Here, S = {5;} represents the set of spin states, the 
spin configuration. We take the unit of temperature such that Boltzmann’s con- 
stant kp is unity, and @ is the inverse temperature f = 1/T. The normalization 
factor Z is the partition function 


Z= >. 5 ibn ye ie ae (1.3) 


Syetl Some Sachi 


One sometimes uses the notation Tr for the sum over all possible spin configu- 
rations appearing in (1.3). Hereafter we use this notation for the sum over the 
values of Ising spins on sites: 


Z=Tre 4, (1.4) 


distribution using angular brackets (---). 

Spin variables are not necessarily restricted to the Ising type (S; = +1). For 
instance, in the XY model, the variable at a site 7 has a real value 0; with modulo 
27, and the interaction energy has the form —J cos(@; — @;). The energy due to 
an external field is —h cos 6;. The Hamiltonian of the XY model is thus written 


as 

H=-J Ý. cos(6; — 0;) -h$ cos Gj. (1.5) 
GEB i 

The XY spin variable 6; can be identified with a point on the unit circle. If 

J > 0, the interaction term in (1.5) is ferromagnetic as it favours a parallel spin 

configuration (6; = 0;). 


1.2 Order parameter and phase transition 


One of the most important quantities used to characterize the macroscopic prop- 
erties of the Ising model with ferromagnetic interactions is the magnetization. 
Magnetization is defined by 


N 
1 Li : 
m = N ) Si = Wit ( Si) P(S) : (1.6) 


i=] 


and measures the overall ordering in a macroscopic system (i.e. the system in 
the thermodynamic limit N — oo). Magnetization is a typical example of an 
order parameter which is a measure of whether or not a macroscopic system is 
in an ordered state in an appropriate sense. The magnetization vanishes if there 
exist. equal numbers of up spins S; = 1 and down spins S; = —1, suggesting the 
absence of a uniformly ordered state. 

At low temperatures 8 >> 1, the Gibbs~Boltzmann distribution (1.2) implies 
that low-energy states are realized with much higher probability than high-energy 
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Fic. 1.2. Temperature dependence of magnetization 


states. The low-energy states of the ferromagnetic Ising model (1.1) without the 
external field h = 0 have almost all spins in the same direction. Thus at low 
temperatures the spin states are either up S; = 1 at almost all sites or down 
S; = ~1 at almost all sites. The magnetization m is then very close to either 1 
or ~1, respectively. 

As the temperature increases, 0 decreases, and then the states with various 
energies emerge with similar probabilities. Under such circumstances, 5; would 
change frequently from 1 to —1 and vice versa, so that the macroscopic state of 
the system is disordered with the magnetization vanishing. The magnetization 
m as a function of the temperature T therefore has the behaviour depicted in 
Fig. 1.2. There is a critical temperature Te; m 4 0 for T < J, and m = 0 for 
T > Te 

This type of phenomenon in a macroscopic system is called a phase transition 
and is characterized by a sharp and singular change of the value of the order 
parameter between vanishing and non-vanishing values. In magnetic systems the 
state for T < Te with m Æ 0 is called the ferromagnetic phase and the state 
at T > Te with m = 0 is called the paramagnetic phase. The temperature Te is 
termed a critical point or a transition point. 


1.3 Mean-field theory 


In principle, it is possible to calculate the expectation value of any physical quan- 
tity using the Gibbs-Boltzmann distribution (1.2). It is, however, usually very 
difficult in practice to carry out the sum over 2% terms appearing in the partition 
function (1.3). One is thus often forced to resort to approximations. Mean-field 
theory (or the mean-field approximation) is used widely in such situations. 


1.3.1 Mean-field Hamiltonian 


The essence of mean-field theory is to neglect fluctuations of microscopic vari- 
ables around their mean values. One splits the spin variable S; into the mean 
m = 5° ,(5;)/N = (5;) and the deviation (fluctuation) 6S; = S;—m and assumes 
that the second-order term with respect to the fluctuation 6S; is negligibly small 
in the interaction energy: 
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H = -J 5 (m + 65;)(m + 6S;) — DD Si 
(GIEB i 


x -Jm Ng -Jm 5 (8S; + 5S;) — >> Si. (1.7) 


(GJ)EB 


To simplify this expression, we note that each bond (77) appears only once in the 
sum of 65; + 65; in the second line. Thus ôS; and 65; assigned at both ends of 
a bond are summed up z times, where z is the number of bonds emanating from 
a given site (the coordination number), in the second sum in the final expression 
of (1.7): 


H = -Jm Ng — Jmz os 65; — h > Si 


= NgJm? — (Jmz + h) t9 Si. (1.8) 
i 


A few comments on (1.8) are in order. 

1. Ng is the number of elements in the set of bonds B, Ng = |B]. 

2. We have assumed that the coordination number z is independent of site 
i, so that Np is related to z by zN/2 = Ng. One might imagine that the 
total number of bonds is zN since each site has z bonds emanating from 
it. However, a bond is counted twice at both its ends and one should divide 
zN by two to count the total number of bonds correctly. 

3. The expectation value (.5;) has been assumed to be independent of i. This 
value should be equal to m according to (1.6). In the conventional ferro- 
magnetic Ising model, the interaction J is a constant and thus the average 
order of spins is uniform in space. In spin glasses and other cases to be 
discussed later this assumption does not hold. 

The effects of interactions have now been hidden in the magnetization m in 
the mean-field Hamiltonian (1.8). The problem apparently looks like a non- 
interacting case, which significantly reduces the difficulties in analytical manip- 
ulations. 


1.3.2 Equation of state 


The mean-field Hamiltonian (1.8) facilitates calculations of various quantities. 
For example, the partition function is given as 


li 


Z = Trexp 84 —NgJ m? + (Jmz +h) ` Si 
$ 


= e7 AN Jm? {2 cosh B(Jmz +h)} . (1.9) 


A similar procedure with S; inserted after the trace operation Tr in (1.9) yields 
the magnetization m, 


6 MEAN-FIELD THEORY OF PHASE TRANSITIONS 


y yrm 


Pf. | y=tanh(BJmz) 
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Fic. 1.3. Solution of the mean-field equation of state 


TrS,e78H 
m= cia aes = tanh 8(Jmz +h). (1.10) 
This equation (1.10) determines the order parameter m and is called the equation 
of state. The magnetization in the absence of the external field h = 0, the spon- 
taneous magnetization, is obtained as the solution of (1.10) graphically: as one 
can see in Fig. 1.3, the existence of a solution with non-vanishing magnetization 
m #0 is determined by whether the slope of the curve tanh(GJmz) at m = 0 is 
larger or smaller than unity. The first term of the expansion of the right hand 
side of (1.10) with h = 0 is GJzm, so that there exists a solution with m Æ 0 if 
and only if GJz > 1. From @Jz = Jz/T = 1, the critical temperature is found 
to be Tọ = Jz. Figure 1.3 clearly shows that the positive and negative solutions 
for m have the same absolute value (+m), corresponding to the change of sign 
of all spins (S; — —5S;, Vi). Hereafter we often restrict ourselves to the case of 
m > 0 without loss of generality. 


1.3.3 Free energy and the Landau theory 


It is possible to calculate the specific heat C, magnetic susceptibility y, and other 
quantities by mean-field theory. We develop an argument starting from the free 
energy. The general theory of statistical mechanics tells us that the free energy 
is proportional to the logarithm of the partition function. Using (1.9), we have 
the mean-field free energy of the Ising model as 


F = ~T log Z = ~NT log{2 cosh B(Jmz + h)} + Np Jm?. (1.11) 
When there is no external field h = 0 and the temperature T is close to the 
critical point 7}, the magnetization m is expected to be close to zero. It is then 


possible to expand the right hand side of (1.11) in powers of m. The expansion 
to fourth order is 


F = -NT log 2 -+ 


5 (1 — BJz)}m? + gm) O. (1.12) 
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Fic. 1.4. Free energy as a function of the order parameter 


It should be noted that. the coefficient of m? changes sign at Tp. As one can see 
in Fig. 1.4, the minima of the free energy are located at m Æ 0 when T < Te and 
at m= 0 if T > Te. The statistical-mechanical average of a physical quantity 


at the state that minimizes the free energy (thermal equilibrium state). Thus the 
magnetization in thermal equilibrium is zero when T > T, and is non-vanishing 
for T < Te. This conclusion is in agreement with the previous argument using 
the equation of state. The present theory starting from the Taylor expansion 
of the free energy by the order parameter is called the Landau theory of phase 
transitions. 


1.4 Infinite-range model 


Mean-field theory is an approximation. However, it gives the exact solution in the 
case of the infinite-range model where all possible pairs of sites have interactions. 
The Hamiltonian of the infinite-range model is 


7 , 
came aa a (1.13) 


The first. sum on the right hand side runs over all pairs of different sites (i, 7) (i = 
1,...,N;j =1,...,N;i 4 j). The factor 2 in the denominator exists so that each 
pair (i,j) appears only once in the sum, for example (S152 + S2$1)/2 = S182. 
The factor N in the denominator is to make the Hamiltonian (energy) extensive 
(i.e. O(N)) since the number of terms in the sum is N(N ~ 1)/2. 

The partition function of the infinite-range model can be evaluated as follows. 
By definition, 


Z = Trexp AD Sis 2 +8hÝ S: ). (1.14) 


Here the constant term —8J/2 compensates for the contribution X`, (8?). This 
term, of O(N? = 1), is sufficiently small compared to the other terms, of O(N), 
in the thermodynamic limit N -—- oo and will be neglected hereafter. Since we 
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cannot carry out the trace operation with the term (X5; S;)? in the exponent, we 
decompose this term by the Gaussian integral 


2 aN f% Nam? /24 V Nama 
guna V2 i. dme Nam (ery Name. (1.15) 
T r 


Substituting a = GJ and x = 37, S:/VN and using (1.9), we find 


Ln OEE PO | NBJIm? z, i 
Tr p in dm exp (m + B8Jm 2 Si + Bh >, Si (1.16) 


BIN f” NBJIm? 
= 4/ | dm exp (a + N log{2 cosh 3(Jm + ih) . (1.17) 
-09 


The problem has thus been reduced to a simple single integral. 

We can evaluate the above integral by steepest descent in the thermodynamic 
limit N — oo: the integral (1.17) approaches asymptotically the largest value of 
its integrand in the thermodynamic limit. The value of the integration variable 
m that gives the maximum of the integrand is determined by the saddle-point 
condition, that is maximization of the exponent: 


a( Bs. 0 Toy _ 
an (- ym + log{2 cosh B(Jm + h3) = 0 (1.18) 
or 

m = tanh (Jm + h). (1.19) 


Equation (1.19) agrees with the mean-field equation (1.10) after replacement of 
J with J/N and z with N. Thus mean-field theory leads to the exact solution 
for the infinite-range model. 

The quantity m was introduced as an integration variable in the evaluation 
of the partition function of the infinite-range model. It nevertheless turned out 
to have a direct physical interpretation, the magnetization, according to the 
correspondence with mean-field theory through the equation of state (1.19). To 
understand the significance of this interpretation from a different point of view, 
we write the saddle-point condition for (1.16) as 


1 
m= = 2 Si. (1.20) 


The sum in (1.20) agrees with the average value m, the magnetization, in the 
thermodynamic limit N -—» oo if the law of large numbers applies. In other 
words, fluctuations of magnetization vanish in the thermodynamic limit in the 
infinite-range model and thus mean-field theory gives the exact result. 

The infinite-range model may be regarded as a model with nearest neighbour 
interactions in infinite-dimensional space. To see this, note that the coordination 
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number z of a site on the d-dimensional hypercubic lattice is proportional to d. 
More precisely, z = 4 for the two-dimensional square lattice, z = 6 for the three- 
dimensional cubic lattice, and z = 2d in general. Thus a site is connected to very 
many other sites for large d so that. the relative effects of fluctuations diminish 
in the limit of large d, leading to the same behaviour as the infinite-range model. 


1.5 Variational approach 

Another point of view is provided for mean-field theory by a variational approach. 
The source of difficulty in calculations of various physical quantities lies in the 
non-trivial structure of the probability distribution (1.2) with the Hamiltonian 
(1.1) where the degrees of freedom § are coupled with each other. It may thus 
be useful to employ an approximation to decouple the distribution into simple 
functions. We therefore introduce a single-site distribution function 


Pilo) = Tr P(S)6(S;, Oi) (1.21) 
and approximate the full distribution by the product of single-site functions: 
P(S) ~ | | P,(Si). (1.22) 
i 


We determine P;(.S;) by the general principle of statistical mechanics to minimize 
the free energy F = E—TS, where the internal energy E is the expectation value 
of the Hamiltonian and S is the entropy (not to be confused with spin). Under 
the above approximation, one finds 


F 


ll 


Tr $ H(S) [| P(S) $ +T Tr 4 [[ P9) Slog P,(Si) 


-J XO TrS;5;P,(Si)P)(S;) — hY TrS;P,(S;) 
(ij)EB i 
+T X TrP;(Sj) log Foe); (1.23) 


where we have used the normalization TrP;(.S;) = 1. Variation of this free energy 
by P;(S;) under the condition of normalization gives 
OF á ry ry 
an —J ò Sym; — hS; + T log P(S;) +T +à =0, (1.24) 
6P,(S;) = 
gel 

where A is the Lagrange multiplier for the normalization condition and we have 
written my for Tr $;P;(S;). The set of sites connected to i has been denoted by 
I. The minimization condition (1.24) yields the distribution function 


exp (a4 jer Si; + BhS;) 


where Zyp is the normalization factor. In the case of uniform magnetization 
m, (= m), this result (1.25) together with the decoupling (1.22) leads to the 
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distribution P(S) x e~°” with H identical to the mean-field Hamiltonian (1.8) 
up to a trivial additive constant. 

The argument so far has been general in that it did not use the values of the 
Ising spins S; = +1 and thus applies to any other cases. It is instructive to use 
the values of the Ising spins explicitly and see its consequence. Since 5; takes 
only two values +1, the following is the general form of the distribution function: 


_ lems; 


P,(S;) 5 


(1.26) 


which is compatible with the previous notation m; = Tr $;P;(S;). Substitution 
of (1.26) into (1.23) yields 


F =J > mum; —h > Mi 


(ij)€B i 


= 1+m; l+m, 1- mi 1 — Mi 
$ Ty ( im i + mi Mj es m) (1.27) 
4 é 


2 2 


Variation of this expression with respect to m; leads to 


mi = tanh @ | J bD Myth (1.28) 
gel 


which is identical to (1.10) for uniform magnetization (m; = m, Vi). We have 
again rederived the previous result of mean-field theory. 


Bibliographical note 


A compact exposition of the theory of phase transitions including mean-field 
theory is found in Yeomans (1992). For a full account of the theory of phase tran- 
sitions and critical phenomena, see Stanley (1987). In Opper and Saad (2001), 
one finds an extensive coverage of recent developments in applications of mean- 
field theory to interdisciplinary fields as well as a detailed elucidation of various 
aspects of mean-field theory. 


2 
MEAN-FIELD THEORY OF SPIN GLASSES 


We next discuss the problem of spin glasses. If the interactions between spins 
are not uniform in space, the analysis of the previous chapter does not apply 
in the naive form. In particular, when the interactions are ferromagnetic for 
some bonds and antiferromagnetic for others, then the spin orientation cannot 
be uniform in space, unlike the ferromagnetic system, even at low temperatures. 
Under such a circumstance it sometimes happens that spins become randomly 
frozen--random in space but frozen in time. This is the intuitive picture of the 
spin glass phase. In the present chapter we investigate the condition for the 
existence of the spin glass phase by extending the mean-field theory so that it 
is applicable to the problem of disordered systems with random interactions. In 
particular we elucidate the properties of the so-called replica-symmetric solution. 
The replica method introduced here serves as a very powerful tool of analysis 
throughout this book. 


2.1 Spin glass and the Edwards—Anderson model 


Atoms are located on lattice points at regular intervals in a crystal. This is 
not the case in glasses where the positions of atoms are random in space. An 
important point is that in glasses the apparently random locations of atoms do 
not change in a day or two into another set of random locations. A state with 
spatial randomness apparently does not change with time. The term spin glass 
implies that the spin orientation has a similarity to this type of location of atom 
in glasses: spins are randomly frozen in spin glasses. The goal of the theory of 
spin glasses is to clarify the conditions for the existence of spin glass states.! 

It is established within mean-field theory that the spin glass phase exists at 
low temperatures when random interactions of certain types exist between spins. 
The present and the next chapters are devoted to the mean-field theory of spin 
glasses. We first introduce a model of random systems and explain the replica 
method, a general method of analysis of random systems. Then the replica- 
symmetric solution is presented. 


'More rigorously, the spin glass state is considered stable for an infinitely long time at 
least within the mean-field theory, whereas ordinary glasses will transform to crystals without 
randomness after a very long period. 


11 
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Let us suppose that the interaction J;; between a spin pair (ij) changes from 
one pair to another. The Hamiltonian in the absence of an external field is then 
expressed as 
H == > Jij SiS. (2.1) 
(ij)EB 


The spin variables are assumed to be of the Ising type (S; = +1) here. Each Ji; is 
supposed to be distributed independently according to a probability distribution 
P(Jij). One often uses the Gaussian model and the +J model as typical examples 
of the distribution of P(J;;). Their explicit forms are 


ry (Jij = Jo)? 
Pug) = 7 exp { JE (2.2) 
P(Jiz) = po( Ji — J) + (1 — p)d( Jig + J), (2.3) 


respectively. Equation (2.2) is a Gaussian distribution with mean Jo and variance 
J? while in (2.3) Jj; is either J (> 0) (with probability p) or —J (with probability 
1 — p). 

Randomness in Jj; has various types of origin depending upon the specific 
problem. For example, in some spin glass materials, the positions of atoms car- 
rying spins are randomly distributed, resulting in randomness in interactions. 
It is impossible in such a case to identify the location of each atom precisely 
and therefore it is essential in theoretical treatments to introduce a probability 
distribution for J;j. In such a situation (2.1) is called the Edwards—Anderson 
model (Edwards and Anderson 1975). The randomness in site positions (site 
randomness) is considered less relevant to the macroscopic properties of spin 
glasses compared to the randomness in interactions (bond randomness). Thus 
Ji; is supposed to be distributed randomly and independently at each bond (ij) 
according to a probability like (2.2) and (2.3). The Hopfield model of neural net- 
works treated in Chapter 7 also has the form of (2.1). The type of randomness of 
Ji; in the Hopfield model is different from that of the Edwards~Anderson model. 
The randomness in J;; of the Hopfield model has its origin in the randomness 
of memorized patterns. We focus our attention on the spin glass problem in 
Chapters 2 to 4. 


2.1.2 Quenched system and configurational average 


Evaluation of a physical quantity using the Hamiltonian (2.1) starts from the 
trace operation over the spin variables S = {5;} for a given fixed (quenched) 
set. of Ji; generated by the probability distribution P(J;;). For instance, the free 
energy is calculated as 

F = —Tlog Tre”, (2.4) 


which is a function of J = {Ji;}. The next step is to average (2.4) over the 
distribution of J to obtain the final expression of the free energy. The latter 
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procedure of averaging is called the configurational average and will be denoted 
by brackets [---] in this book, 


F =x —T [log Z| = -r | Javy dg) log Z. (2.5) 
(ii) 


Differentiation of this averaged free energy |F] by the external field h or the 
temperature T leads to the magnetization or the internal energy, respectively. 
The reason to trace out first S for a given fixed J is that the positions of atoms 
carrying spins are random in space but fixed in the time scale of rapid thermal 
motions of spins. It is thus appropriate to evaluate the trace over S first with 
the interactions J fixed. 

It happens that the free energy per degree of freedom f(J) = F(J)/N has 
vanishingly small deviations from its mean value |f] in the thermodynamic limit 
N -— oo. The free energy f for a given J thus agrees with the mean |f] with 
probability 1, which is called the self-averaging property of the free energy. Since 
the raw value f for a given J agrees with its configurational average [f] with 
probability 1 in the thermodynamic limit, we may choose either of these quan- 
tities in actual calculations. The mean [f] is easier to handle because it has no 
explicit dependence on J even for finite-size systems. We shall treat the average 
free energy in most of the cases hereafter. 


2.1.3 Replica method 


The dependence of log Z on J is very complicated and it is not easy to calculate 
the configurational average {log Z]. The manipulations are greatly facilitated by 
the relation 
m 
[log Z] = lim Kadi (2.6) 
rm n 

One prepares n replicas of the original system, evaluates the configurational 
average of the product of their partition functions Z”, and then takes the limit 
n — 0. This technique, the replica method, is useful because it is easier to evaluate 
[Z”] than [log Z]. 

Equation (2.6) is an identity and is always correct. A problem in actual replica 
calculations is that one often evaluates |Z"] with positive integer n in mind and 
then extrapolates the result to n -> 0. We therefore should be careful to discuss 
the significance of the results of replica calculations. 


2.2 Sherrington—Kirkpatrick model 


The mean-field theory of spin glasses is usually developed for the Sherrington~ 
Kirkpatrick (SK) model, the infinite-range version of the Edwards~Anderson 
model (Sherrington and Kirkpatrick 1975). In this section we introduce the SK 
model and explain the basic methods of calculations using the replica method. 
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2.2.1 SK model 

The infinite-range model of spin glasses is expected to play the role of mean-field 
theory analogously to the case of the ferromagnetic Ising model. We therefore 
start from the Hamiltonian 


H=- Jg5i5; ated (2.7) 
i<j 


The first sum on the right hand side runs over all distinct pairs of spins, N(N — 
1)/2 of them. The interaction J;; is a quenched variable with the Gaussian 
distribution function 


1 iN Not Jo A 
P(Jij) = FNV OF exp on (4. = 2) | i (2.8) 
The mean and variance of this distribution are both proportional to 1/N: 
, Jo i 2 J? 
Jig] = apr Adis)" = aF- (2.9) 


The reason for such a normalization is that extensive quantities (e.g. the energy 
and specific heat) are found to be proportional to N if one takes the above 
normalization of interactions, as we shall see shortly. 


2.2.2 Replica average of the partition function 


According to the prescription of the replica method, one first has to take the 
configurational average of the nth power of the partition function 


iz" = | | TJAP Uy) | tre DUES I S , 
< i< Ces t=1 a=] 
j i (2.10) 


where a is the replica index. The integral over Jij can be carried out indepen- 
dently for each (ij) using (2.8). The result, up to a trivial constant, is 


Trexp +> 3P I SeS eaS, ve 


i<j GAS] 
(2.11) 
By rewriting the sums over i < j and a, in the above exponent, we find the 
following form, for sufficiently large N: 


o 
n NB?J?n e 


a< 4 


T du 5 oe + Bh Ds ware: (2.12) 
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2.2.3 Reduction by Gaussian integral 


We could carry out the trace over S7 independently at each site i in (2.12) if the 

quantities in the exponent were linear in the spin variables. It is therefore useful 

to linearize those squared quantities by Gaussian integrals with the integration 
‘ . By2 s A2 

variables qag for the term (3°; Sf S; Y and Mma for (30, SF): 


L N Q2 72, 
[Z"] = exp (==) 1 I] daag / [ [ame 


a<f 


- exp [| — 5 ya oe Somi 
a< a 


- Trexp | J? a dag y ses? +8 S (oma +h) a SH |. (2.13) 


a< i 


If we represent the sum over the variable at a single site ($ ge) also by the 
symbol Tr, the third line of the above equation is 


N 
Trexp | P >b qagS° 5" + BY (Joma + h)S@ = exp(N log Tre”), 
a< iB fay 
(2.14) 
where 
L= PP Y qa% + BY (Joma + h)S*. (2.15) 
ace a 


We thus have 


32 72, 
[Z"] = exp (=) i I] das f [[ ara 


a<p 
NOJ? ; NOJ , 
exp | ~ z >. dže Pe. : Som + Nlog Tre” |. (2.16) 
acg a 


2.2.4 Steepest descent 


The exponent of the above integrand is proportional to N, so that it is possible 
to evaluate the integral by steepest descent. We then find in the thermodynamic 
limit N — oo 


4 BJ Ly p x N & G 
[Z"| = exp | — > m2 + N log Tre” + a Crn 
x 


laf T 
2 «<8 2 


16 MEAN-FIELD THEORY OF SPIN GLASSES 


ae PP y g? Bo Some + l log Tre’ + l 2 J 
-= l 4n cr lab ~ “op er eer ea 
OX af 


In deriving this last expression, the limit n -> 0 has been taken with N kept 
very large but finite. The values of gag and Ma in the above expression should 
be chosen to extremize (maximize or minimize) the quantity in the braces { }. 
The replica method therefore gives the free energy as 


my 32 J? 
—3{f] = lim ai = lim e í x Tap 


n=0 nl n=0 án =, 
(A ae eet eee eee 
ae dm -+ 38 J + z log Tre” >. (2.17) 


The saddle-point condition that the free energy is extremized with respect to the 
variable qag (a # (3) is 


8 = seop log Tre’ = ————— = (8% 9), 2.18 
where (---)7, is the average by the weight e+. The saddle-point condition for ma 
is, similarly, 


is r OX oy be 
a log Tre” = ees 


1 Oo 
~ Bho ôma Thee oe) 


Me 


2.2.5 Order parameters 


The variables qag and mq have been introduced for technical convenience in 
Gaussian integrals. However, these variables turn out to represent order parame- 
ters in a similar manner to the ferromagnetic model explained in $1.4. To confirm 
this fact, we first note that (2.18) can be written in the following form: 


Tr SF SÊ exp(— 2 Ay) 


3 = (8298 ; 
dag = 157507] T ep CAZ H) |’ (e 
where H, denotes the yth replica Hamiltonian 


i<j i 


It is possible to check, by almost the same calculations as in the previous sections, 
that (2.18) and (2.20) represent the same quantity. First of all the denominator of 
(2.20) is Z”, which approaches one as n — 0 so that it is unnecessary to evaluate 
them explicitly. The numerator corresponds to the quantity in the calculation 
of [Z"| with S%S? inserted after the Tr symbol. With these points in mind, one 


REPLICA-SYMMETRIC SOLUTION 17 


can follow the calculations in §2.2.2 and afterwards to find the following quantity 
instead of (2.14): l en 
mMer Tier, (2.22) 


The quantity log Tre” is proportional to n as is seen from (2.17) and thus Tr e” 
approaches one as n — 0. Hence (2.22) reduces to Tr(S*S%e”) in the limit 
n — 0. One can then check that (2.22) agrees with (2.18) from the fact that the 
denominator of (2.18) approaches one. We have thus established that (2.20) and 
(2.18) represent the same quantity. Similarly we find 


Ma = (SF). (2.23) 


The parameter m is the ordinary ferromagnetic order parameter according 
to (2.23), and is the value of ma when the latter is independent of a. The other 
parameter qag is the spin glass order parameter. This may be understood by 
remembering that traces over all replicas other than œ and ( cancel out in the 
denominator and numerator in (2.20). One then finds 


Tr Se Ba. Tr 58 o-bAs 6 
daa = re pE Aiye ois = [(S2) (SP)] = KS] =q (2.24) 


if we cannot distinguish one replica from another. In the paramagnetic phase at 
high temperatures, (6;} (which is (S@) for any single a) vanishes at each site i 
and therefore m = q = 0. The ferromagnetic phase has ordering almost uniform 
in space, and if we choose that orientation of ordering as the positive direction, 
then (S;) > 0 at most sites. This implies m > 0 and q > 0. 

If the spin glass phase characteristic of the Edwards~Anderson model or the 
SK model exists, the spins in that phase should be randomly frozen. In the spin 
glass phase (S;) is not vanishing at any site because the spin does not fluctuate 
significantly in time. However, the sign of this expectation value would change 
from site to site, and such an apparently random spin pattern does not change 
in time. The spin configuration frozen in time is replaced by another frozen 
spin configuration for a different set of interactions J since the environment 
of a spin changes drastically. Thus the configurational average of (S;) over the 
distribution of J corresponds to the average over various environments at a given 
spin, which would yield both (S;) > 0 and (S;) < 0, suggesting the possibility of 
m = |[(S;)| = 0. The spin glass order parameter q is not vanishing in the same 
situation because it is the average of a positive quantity ($;)?. Thus there could 
exist a phase with m = 0 and q > 0, which is the spin glass phase with q as the 
spin glass order parameter. It will indeed be shown that the equations of state 
for the SK model have a solution with m = 0 and q > 0. 


2.3 Replica-symmetric solution 

2.3.1 Equations of state 

It is necessary to have the explicit dependence of qag and Mma on replica indices 
a and @ in order to calculate the free energy and order parameters from (2.17) 
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to (2.19). Naively, the dependence on these replica indices should not affect the 
physics of the system because replicas have been introduced artificially for the 
convenience of the configurational average. It therefore seems natural to assume 
replica symmetry (RS), dag = q and ma = m (which we used in the previous 
section), and derive the replica-symmetric solution. 


i G2, 2 G ce e 1 aa r 1 nf G 
-Bif =- = {-n(n 1)q?} oo nm? p log Tre” + re (2.25) 


The third term on the right hand side can be evaluated using the definition of 
L, (2.15), and a Gaussian integral as 


[3272 
log Tre’ = log Tr a = T dz 
T 


BJ’ a, 9272 a "92 72 Bl Tear a 
-exp 3 7 Bd qz% S — 38 J q+ B(Jom+h) 5° 
(aa & 


= log | Dz exp (n log 2 cosh(8J yqz + BJom + Bh) — BJ 24) 
= log (1 +n / Dz log 2 cosh BH(z) — sos + otn?) ; (2.26) 


Here Dz = dz exp(—z?/2)/V/2z is the Gaussian measure and H(z) = J\/qz + 
Jom + h. Inserting (2.26) into (2.25) and taking the limit n — 0, we have 

BP J? 
~ 4 


—Alf] (1—q)?- Jom? + fo log 2 cosh GH (2). (2.27) 


The extremization condition of the free energy (2.27) with respect to m is 
m = fo tanh GH(z). (2.28) 


This is the equation of state of the ferromagnetic order parameter m and cor- 
responds to (2.19) with the trace operation being carried out explicitly. This 
operation is performed by inserting gag = q and Ma = m into (2.15) and taking 
the trace in the numerator of (2.18). The denominator reduces to one as n — 0. It 
is convenient to decompose the double sum over a and 2 by a Gaussian integral. 

Comparison of (2.28) with the equation of state for a single spin in a field 
m = tanh 2h (obtained from (1.10) by setting J = 0) suggests the interpretation 
that the internal field has a Gaussian distribution due to randomness. 

The extremization condition with respect to q is 


az r 
82I? 
2 


BJ 
N7 


z= 0, (2.29) 


(q-1)+ f De{tanh g): 
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partial integration of which yields 
q=1- [o sech’ GH (z) = i Dz tanh” GH(z). (2.30) 


2.3.2 Phase diagram 


The behaviour of the solution of the equations of state (2.28) and (2.30) is 
determined by the parameters 8 and Jo. For simplicity let us restrict ourselves to 
the case without external field h = 0 for the rest of this chapter. If the distribution 
of Jj; is symmetric (Jo = 0), we have H(z) = J vaz so that tanh GH/(z) is an odd 
function. Then the magnetization vanishes (m = 0) and there is no ferromagnetic 
phase. The free energy is 


la ae : 
— lf] = rime =g) + fo log 2 cosh( 8J \/qz). (2.31) 


To investigate the properties of the system near the critical point where the spin 
glass order parameter q is small, it is convenient to expand the right hand side 
of (2.31) as 


6 l 49 79 Py? 2 72)/2 3 ; 
Bf] = GA T — log 2 — —(1 — BI" a" + Olg). (2.32) 
The Landau theory tells us that the critical point is determined by the condition 
of vanishing coefficient of the second-order term q? as we saw in (1.12). Hence 
the spin glass transition is concluded to exist at T = J = Tr. 

It should be noted that the coefficient of q? in (2.32) is negative if T > Tp. This 
means that the paramagnetic solution (q = 0) at high temperatures maximizes 
the free energy. Similarly the spin glass solution q > 0 for T < Trp maximizes 
the free energy in the low-temperature phase. This pathological behaviour is a 
consequence of the replica method in the following sense. As one can see from 
(2.25), the coefficient of q?, which represents the number of replica pairs, changes 
the sign at n = 1 and we have a negative number of pairs of replicas in the limit 
n — 0, which causes maximization, instead of minimization, of the free energy. 
By contrast the coefficient of m does not change as in (2.25) and the free energy 
can be minimized with respect to this order parameter as is usually the case in 
statistical mechanics. 

A ferromagnetic solution (m > 0) may exist if the distribution of Jj; is not 
symmetric around zero (Jo > 0). Expanding the right hand side of (2.30) and 
keeping only the lowest order terms in q and m, we have 


q = Pq + Bem’. (2.33) 


If Jo = 0, the critical point is identified as the temperature where the coefficient 
B? J? becomes one by the same argument as in §1.3.2. This result agrees with 
the conclusion already derived from the expansion of the free energy, Tp = J. 
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Fic. 2.1. Phase diagram of the SK model. The dashed line is the boundary 
between the ferromagnetic (F) and spin glass (SG) phases and exists only 
under the ansatz of replica symmetry. The dash-dotted lines will be explained 
in detail in the next chapter: the replica-symmetric solution is unstable below 
the AT line, and a mixed phase (M) emerges between the spin glass and 
ferromagnetic phases. The system is in the paramagnetic phase (P) in the 
high-temperature region. 


If Jo > 0 and m > 0, (2.33) implies g = O(m?). We then expand the right 
hand side of the equation of state (2.28) bearing this in mind and keep only the 
lowest order term to obtain 


m = BJom + O(q). (2.34) 


It has thus been shown that the ferromagnetic critical point, where m starts to 
assume a non-vanishing value, is 8Jo = 1 or Te = Jo. 

We have so far derived the boundaries between the paramagnetic and spin 
glass phases and between the paramagnetic and ferromagnetic phases. The bound- 
ary between the spin glass and ferromagnetic phases is given only by numerically 
solving (2.28) and (2.30). Figure 2.1 is the phase diagram thus obtained. The 
spin glass phase (q > 0,m = 0) exists as long as Jo is smaller than J. This spin 
glass phase extends below the ferromagnetic phase in the range Jo > J, which is 
called the re-entrant transition. The dashed line in Fig. 2.1 is the phase bound- 
ary between the spin glass and ferromagnetic phases representing the re-entrant 
transition. As explained in the next chapter, this dashed line actually disappears 
if we take into account the effects of replica symmetry breaking. Instead, the ver- 


ing. The properties of the mixed phase will be explained in the next chapter. 
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2.3.3 Negative entropy 


Failure of the assumption of replica symmetry at low temperatures manifests 
itself in the negative value —1/2r of the ground-state entropy for Jo = 0. This 
is a clear inconsistency for the Ising model with discrete degrees of freedom. To 
verify this result from (2.31), we first derive the low-temperature form of the 
spin glass order parameter q. According to (2.30), q tends to one as T — 0. We 
thus assume q = 1 ~ aT (a > 0) for T very small and check the consistency of 
this linear form. The q in sech” H(z) of (2.30) can be approximated by one to 
leading order. Then we have, for 8 — oo, 


oom 1 j l 
y Dz sech? ßJz = BI J Dew tanh 3 Jz 


1 IT . 
5; f Depis roe (2.35) 


This result confirms the consistency of the assumption q = 1 — af with a = 
eee 

To obtain the ground-state entropy, it is necessary to investigate the be- 
haviour of the first term on the right hand side of (2.31) in the limit T — 0. 
Substitution of q = 1—aT into this term readily leads to the contribution —T /27r 
to the free energy. The second term, the integral of log 2 cosh H (z), is evaluated 
by separating the integration range into positive and negative parts. These two 
parts actually give the same value, and it is sufficient to calculate one of them 
and multiply the result by the factor 2. The integral for large @ is then 


2) DzsBJVJqz + log(1 +724 v9) bw: +2 f Dz e284 vaz, 
| Z {8 Vaz + log( e )} Jon A ze 


The second term can be shown to be of O(T?), and we may neglect it in our 
evaluation of the ground-state entropy. The first term contributes —\/2/aJ+T/m 
to the free energy. The free energy in the low-temperature limit therefore behaves 


as 
[f] © A NA (2.37) 


r Qn’ 


from which we conclude that the ground-state entropy is —1/2a and the ground- 
state energy is —,/2/aJ. 

It was suspected at an early stage of research that this negative entropy 
might have been caused by the inappropriate exchange of limits n -> 0 and 
N — oo in deriving (2.17). The correct order is N — oo after n — 0, but we 
took the limit N — oo first so that the method of steepest descent is applicable. 
However, it has now been established that the assumption of replica symmetry 
das = q (V(aZ);a # 8) is the real source of the trouble. We shall study this 
problem in the next chapter. 
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Bibliographical note 


Developments following the pioneering contributions of Edwards and Anderson 
(1975) and Sherrington and Kirkpatrick (1975) up to the mid 1980s are sum- 
marized in Mézard et al. (1987), Binder and Young (1986), Fischer and Hertz 
(1991), and van Hemmen and Morgenstern (1987). See also the arguments and 
references in the next chapter. 
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Let us continue our analysis of the SK model. The free energy of the SK model 
derived under the ansatz of replica symmetry has the problem of negative en- 
tropy at low temperatures. It is therefore natural to investigate the possibility 
that the order parameter gag may assume various values depending upon the 
replica indices œ and Ø and possibly the a-dependence of ma. The theory of 
replica symmetry breaking started in this way as a mathematical effort to avoid 
unphysical conclusions of the replica-symmetric solution. It turned out, however, 
that the scheme of replica symmetry breaking has a very rich physical implica- 
tion, namely the existence of a vast variety of stable states with ultrametric 
structure in the phase space. The present chapter is devoted to the elucidation 
of this story. 


3.1 Stability of replica-symmetric solution 


It was shown in the previous chapter that the replica-symmetric solution of the 
SK model has a spin glass phase with negative entropy at low temperatures. 
We now test the appropriateness of the assumption of replica symmetry from a 
different point of view. 

A necessary condition for the replica-symmetric solution to be reliable is that 
the free energy is stable for infinitesimal deviations from that. solution. To check 
such a stability, we expand the exponent appearing in the calculation of the 
partition function (2.16) to second order in (qag — q) and (mq — m), deviations 
from the replica-symmetric solution, as 


J I] dma I] dgag exp -ON {frs + (quadratic in (qag — q) and (Mma —m))}), 
a acs 

(3.1) 

where frs is the replica-symmetric free energy. This integral should not diverge 

in the limit N — oo and thus the quadratic form must be positive definite (or 

at least. positive semi-definite). We show in the present section that this stability 

condition of the replica-symmetric solution is not satisfied in the region below a 


and Thouless 1978). The explicit form of the solution with replica symmetry 
breaking below the AT line and its physical significance will be discussed in 
subsequent sections. 
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3.1.1 Hessian 


We restrict ourselves to the case h = 0 unless stated otherwise. It is convenient 
to rescale the variables as 


BJ dag =Y, VBJome = 2%. (3.2) 


Then the free energy is, from (2.17), 


i= — FE ~ him 5-3 GP DZ 
À 4 peat in 


ac 62 


+ log Trexp | BJ XO yP SAS? + VBa Y a*s% 1}. (8.3) 


a< a 


Let us expand [f] to second order in small deviations around the replica-symmetric 
solution to check the stability, 


rlr + e“, yr =yt ne, (3.4) 
The final term of (3.3) is expanded to second order in €% and 1%? 
notation Lo = BJy ives 8*8? + JB Jot Yq Sa 


as, with the 


log Trexp | Lo + GJ >> nP SA SÊ 44/6 Ie i ese 


ac 
~ log Tre’? + —— eis See 828P) ko g — as Da ie daa SS? S78" Vy. 
af ac By~<d 
a8 Dee s“) ) to (5) ion F Soa S*59) 5, ($7 Sê \ ho 
a< y<å 
BREY: UIE T RG 
5 a<B 
+ BIV/BJo >, Y en P (SSP) (3.5) 
6 &<ß 


Here (---}z, denotes the average by the replica-symmetric weight e&°. We have 
used the facts that the replica-symmetric solution extremizes (3.3) (so that the 
terms linear in e® and n° vanish) and that Tre’° — 1 as n — 0 as explained 
in §2.3.1. We see that the second-order term of [f] with respect to e° and 7% 
is, taking the first and second terms in the braces {---} in (3.3) into account, 


A= 5d {Fas = AILIS SP) ta — aaa Pere 
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+ BIV/B.Jo Y So (9b) (SFS T Taua eon 


ô a<p 
+5 = D D {ieman -PUSS Sr, 
2 8 y< 
za aa (3.6) 


up to the trivial factor of Gn (which is irrelevant to the sign). We denote the 
matrix of coefficients of this quadratic form in e° and n°? by G which is called 
the Hessian matrix. Stability of the replica-symmetric solution requires that the 
eigenvalues of G all be positive. 

To derive the eigenvalues, let us list the matrix elements of G. Since (---) ze 
represents the average by weight of the replica-symmetric solution, the coefficient 
of the second-order terms in € has only two types of values. To simplify the 
notation we omit the suffix Lo in the present section. 


Gaa = 1 — BJo(1— (S*)7) =A (3.7) 
Gap = —BJo((S°S*) — (S*)?) = B. (3.8) 
The coefficients of the second-order term in 7 have three different values, the 


diagonal and two types of off-diagonal elements. One of the off-diagonal elements 
has a matched single replica index and the other has all indices different: 


Gasan = 1— 8 I?(1 — (S*8*)?) = P (3.9) 
G(ap)(ar) = ~ 8° I" ((S7S7) — (8*9")*) = Q (3.10) 
Gapira) = -B J? ((S% 59578?) — (S*S%)?) = R. (3.11) 
Finally there are two kinds of cross-terms in € and n: 
Galap) = BIW BIS HSS.) — (8°)) = C (3.12) 
Gop) = BI V BJSS SP) — (8%9°S7)) = D. (3.13) 


These complete the elements of G. 

The expectation values appearing in (3.7) to (3.13) can be evaluated from the 
replica-symmetric solution. The elements of G are written in terms of (S°) =m 
and ($° SÊ) = q satisfying (2.28) and (2.30) as well as 


(S*S°S) =t= fo tanh? GH(z) (3.14) 
(828818) =r = / Dz tanh BÄ). (3.15) 


The integrals on the right of (3.14) and (3.15) can be derived by the method in 
§2.3.1. 
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3.1.2 Eigenvalues of the Hessian and the AT line 


We start the analysis of stability by the simplest case of paramagnetic solution. 
All order parameters m,q,r, and t vanish in the paramagnetic phase. Hence 
B,Q,R,C, and D (the off-diagonal elements of G) are all zero. The stability 
condition for infinitesimal deviations of the ferromagnetic order parameter €% is 
A > 0, which is equivalent to 1 — 8Jo > 0 or T > Jo from (3.7). Similarly the 
stability for spin-glass-like infinitesimal deviations n°? is P > 0 or T > J. These 
two conditions precisely agree with the region of existence of the paramagnetic 
phase derived in §2.3.2 (see Fig. 2.1). Therefore the replica-symmetric solution 
is stable in the paramagnetic phase. 

It is a more elaborate task to investigate the stability condition of the ordered 
phases. It is necessary to calculate all eigenvalues of the Hessian. Details are given 
in Appendix A, and we just mention the results here. 

Let us write the eigenvalue equation in the form 


Gas AEs ( Gay) (3.16) 


The symbol {e*} denotes a column from et at the top to e” at the bottom, and 
{7%} is for q!* to n?—-1™, 

The first eigenvector u, has e* = a and 7% = b, uniform in both parts. Its 
eigenvalue is, in the limit n — 0, 


i= 5{4-B+P-4Q+3R+ (A-B-P+aQ—3RP—8(C— DP}. 
(3.17) 
The second eigenvector pty has e? = a for a specific replica 9 and e* = b otherwise, 
and 7°? = c when a or 8 is equal to @ and n°? = d otherwise. The eigenvalue of 
this eigenvector becomes degenerate with à; in the limit n — 0. The third and 
final eigenvector pa has e? = a,€” = a for two specific replicas 0, v and e* = b 
otherwise, and n°” = c,n°* = n’* = d and n°? = e otherwise. Its eigenvalue is 


Ag = P~2Q + R. (3.18) 
A sufficient condition for 1, A2 > 0 is, from (3.17), 
A-B=1-8A(1—qg) >0, P—4Q4+3R = 1-67J7(1—4¢+3r) > 0, (3.19) 


These two conditions are seen to be equivalent to the saddle-point condition 
of the replica-symmetric free energy (2.27) with respect to m and q as can be 
verified by the second-order derivatives: 

ey 
1 o*l] 


A — B = — ; 
` Aan 2 
Jo Om RS 


2 lf] 


$0, P10 43R= = 
BJ? 3g? Irs 


>0. (3.20) 


These inequalities always hold as has been mentioned in §2.3.2. 
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RS 


RSB 


0 1 kp T/J 


Fic. 3.1. Stability limit of the replica-symmetric (RS) solution in the h-T phase 
diagram (the AT line) below which replica symmetry breaking (RSB) occurs. 


The condition for positive Ag is 


P-2Q+R=1-LP(1—-2 +7) >0 (3.21) 
or more explicitly 
i | 
(5) > fo sech (8J yqz + BJom). (3.22) 


By numerically solving the equations of state of the replica-symmetric order pa- 
rameters (2.28) and (2.30), one finds that the stability condition (3.22) is not 
satisfied in the spin glass and mixed phases in Fig. 2.1. The line of the limit of 
stability within the ferromagnetic phase (i.e. the boundary between the ferromag- 
netic and mixed phases) is the AT line. The mixed phase has finite ferromagnetic 
order but replica symmetry is broken there. More elaborate analysis is required 
in the mixed phase as shown in the next section. 

The stability of replica symmetry in the case of finite h with symmetric 
distribution Jo = 0 can be studied similarly. Let us just mention the conclusion 
that the stability condition in such a case is given simply by replacing Jom by h 
in (3.22). The phase diagram thus obtained is depicted in Fig. 3.1. A phase with 
broken replica symmetry extends into the low-temperature region. This phase is 
also often called the spin glass phase. The limit of stability in the present case 
is also termed the AT line. 


3.2 Replica symmetry breaking 

The third eigenvector 43, which causes replica symmetry breaking, is called the 
replicon mode. There is no replica symmetry breaking in ma since the replicon 
mode has a = b for e? and e” in the limit n — 0, as in the relation (A.19) or 
(A.21) in Appendix A. Only gag shows dependence on a and £. It is necessary to 
clarify how gag depends on a and Ø, but unfortunately we are not aware of any 
first-principle argument. which can lead to the exact solution. One thus proceeds 
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by trial and error to check if the tentative solution satisfies various necessary 
conditions for the correct solution, such as positive entropy at low temperatures 
and the non-negative eigenvalue of the replicon mode. 

The only solution found so far that satisfies all necessary conditions is the one 
by Parisi (1979, 1980). The Parisi solution is believed to be the exact solution of 
the SK model also because of its rich physical implications. The replica symmetry 
is broken in multiple steps in the Parisi solution of the SK model. We shall explain 
mainly its first step in the present section. 


3.2.1 Parisi solution 


Let us regard qag (a # 3) of the replica-symmetric solution of the SK model 
as an element of an n x n matrix. Then all the elements except those along the 
diagonal have the common value q, and we may write 


{dae} as 0 . (3.23) 


In the first step of replica symmetry breaking (1RSB), one introduces a pos- 
itive integer mı (< n) and divides the replicas into n/m, blocks. Off-diagonal 
blocks have go as their elements and diagonal blocks are assigned qı. All diagonal 
elements are kept 0. The following example is for the case of n = 6, m = 3. 


Oa n 
qı 0 qı) qo 
an 9 : 

po : 3.24 
Onn ae 
do inn 
qq 0 


In the second step, the off-diagonal blocks are left untouched and the diagonal 
blocks are further divided into m /mz2 blocks. The elements of the innermost 
blocks are assumed to be q> and all the other elements of the larger diagonal 
blocks are kept as qı. For example, if we have n = 12, m1 = 6,m2 = 3, 
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Ogg 
og q 
q2 q2 0 


0 g2 qe qo 


q 0y 
o| : 

Co . (3.25) 

g2 0g) q 

q2 q2 0 


do 0 go qe 


qd p 
g2 qa 0 


The numbers n,m ,,mg,... are integers by definition and are ordered as n > 
mi > mg >> 1, 
Now we define the function q(x) as 


qlz) =q (Mii < £ < mi) (3.26) 


and take the limit n -> 0 following the prescription of the replica method. We 
somewhat. arbitrarily reverse the above inequalities 


O<m<-S1 O<2<¢1) (3.27) 


and suppose that g(a”) becomes a continuous function defined between 0 and 1. 
This is the basic idea of the Parisi solution. 


3.2.2 First-step RSB 


We derive expressions of the physical quantities by the first-step RSB (1RSB) 
represented in (3.24). The first term on the right hand side of the single-body 
effective Hamiltonian (2.15) reduces to 


1 n 2 n/m my 2 
>, gags SF =5)% (>: se) + (qı — qo) 5 ( 5 se) — Nq 
a 


a<8 block \a€block 
(3.28) 


The first term on the right hand side here fills all elements of the matrix {qag} 
with go but the block-diagonal part is replaced with qı by the second term. The 
last term forces the diagonal elements to zero. Similarly the quadratic term of 
dag in the free energy (2.17) is 


1 > 1 E Te. Gey 2 P i ; ; 

i oo a oe ae Ps canes. ey SE ee ee 
lim = SU dag = lim = 4 n7q5 + —mi (a? — 9) — nai p = (mi — Yaz — mg. 

n—=0 n f nO Th my 
aF#B 

(3.29) 
We insert (3.28) and (3.29) into (2.17) and linearize ($>, S®)? in (3.28) by a 
Gaussian integral in a similar manner as in the replica-symmetric calculations. 
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It is necessary to introduce 1 + n/m, Gaussian variables corresponding to the 
number of terms of the form (5>,, S®)? in (3.28). Finally we take the limit n — 0 
to find the free energy with 1RSB as 


8J? ; ; BJo : 
Cfirss = r {(mi — 1)q? — mig? + 2q1 — 1} + race — log 2 
1 i l ; 
—-—— | Du log | Dv cosh™ E (3.30) 
Ma, l 
E= BJ /qutJJ/u ~ gov + Jom + h). (3.31) 


Here we have used the replica symmetry of magnetization m = mq. 

The variational parameters qo, qi; m, and mı all fall in the range between 0 
and 1. The variational (extremization) conditions of (3.30) with respect to m, qo, 
and qı lead to the equations of state: 


f Ducosh™! Z tanh& 

f Dvcosh™ = 
FYL g el Comal 2 
J Ducosh™ Z tanh= ae 
v cosh™ & 

D f Dvcosh™ E tanh? = 
qı = í = 
J Dvcosh™ = 


m= | Du (3.32) 


(3.34) 


Comparison of these equations of state for the order parameters with those for 
the replica-symmetric solution (2.28) and (2.30) suggests the following interpre- 
tation. In (3.32) for the magnetization, the integrand after Du represents mag- 
netization within a block of the IRSB matrix (3.24), which is averaged over all 
blocks with the Gaussian weight. Analogously, (3.34) for q; is the spin glass order 
parameter within a diagonal block averaged over all blocks. In (3.33) for go, on 
the other hand, one first calculates the magnetization within a block and takes its 
product between blocks, an interblock spin glass order parameter. Indeed, if one 
carries out the trace operation in the definition of gag, (2.18), by taking a and 
B within a single block and assuming IRSB, one obtains (3.34), whereas (3.33) 
results if a and @ belong to different blocks. The Schwarz inequality assures 
qı = qo 

We omit the explicit form of the extremization condition of the free energy 
(3.31) by the parameter m; since the form is a little complicated and is not used 
later. 

When Jo = h = 0, = is odd in u,v, and thus m = 0 is the only solution 
of (3.32). The order parameter qı can be positive for T < Tẹ = J because the 
first term in the expansion of the right hand side of (3.34) for small go and 
qı is B?.J?q,. Therefore the RS and 1RSB give the same transition point. The 
parameter mı is one at Ty and decreases with temperature. 
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3.2.3 Stability of the first-step RSB 


The stability of IRSB can be investigated by a direct generalization of the argu- 
ment in §3.1. We mention only a few main points for the case Jp = h = 0. It is 
sufficient to treat two cases: one with all indices a, 3, -y, 6 of the Hessian elements 
within the same block and the other with indices in two different. blocks. 

If a and 8 in gag belong to the same block, the stability condition of the 
replicon mode for infinitesimal deviations from 1RSB is expressed as 


a! Du cosh™!74 = 
`a = P — 2¢ = 1] = 8J? pus 
= Se pa ; 7 | Deeosh™ = 


> 0. (3.35) 
For the replicon mode between two different blocks, the stability condition reads 


f Ducosh™ “t E 
J Ducosh™ = 


4 
`= P-2RQ+R=1- PI I Du ( > 0. (3.36) 


According to the Schwarz inequality, the right hand side of (3.35) is less than 
or equal to that of (3.36), and therefore the former is sufficient. Equation (3.35) 
is not satisfied in the spin glass phase similar to the case of the RSB solution. 
However, the absolute value of the eigenvalue is confirmed by numerical evalu- 
ation to be smaller than that of the RS solution although A3 is still negative. 
This suggests an improvement towards a stable solution. The entropy per spin 
at Jo = 0,T = 0 reduces from —0.16 (= —1/27) for the RS solution to —0.01 for 
the 1RSB. Thus we may expect to obtain still better results if we go further into 
replica symmetry breaking. 


3.3 Full RSB solution 

Let us proceed with the calculation of the free energy (2.17) by a multiple-step 
RSB. We restrict ourselves to the case Jo = 0 for simplicity. 

3.3.1 Physical quantities 


The sum involving q?, in the free energy (2.17) can be expressed at the Kth 
step of RSB (A-RSB) as follows by counting the number of elements in a similar 
way to the 1RSB case (3.29): 


> das 


AFB 
1,2 eee er ed Ee ce ee Sas bce 
= gon” + (01 — dg) ee + (q 1 q Ma eter eee nh 
a aria a a eng qk 
K 
=n S (m; — mjt), (3.37) 
j=0 


where | is an arbitrary integer and mp = n,mxK41 = 1. In the limit n — 0, we 
may use the replacement Mi —mj41 — ~de to find 
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= E ths > [a q'(z)dz. (3.38) 


' aß 


The internal energy for Jo = 0, h = 0 is given by differentiation of the free 
energy (2.17) by 8 as? 


R BI? 2 , BI? F CE 
B=- [143E la) >- h1- f Pear). ea 


The magnetic susceptibility can be written down from the second derivative of 
(2.17) by h as 


1 
x= 6) 1+ = N a s(1- / alejar); (3.40) 


ağ 
It needs some calculations to derive the free energy in the full RSB scheme. 
Details are given in Appendix B. The final expression of the free energy (2.17) 
is 


272 1 . 
Bf =— = at + i q(x) de — 24(1)} = J Du fo(0, v q(0yu). (3.41) 


Here fo satisfies the Parisi equation 


Ofo(x,h) J? dq | fo Afo\? 
ôx ~=©6©2 da | ðk? Dy Oh } (3.42) 


to be solved under the initial condition fo(1,h) = log 2 cosh Gh. 


3.3.2 Order parameter near the critical point 
It is in general very difficult to find a solution to the extremization condition 
of the free energy (3.41) with respect to the order function q(x). It is neverthe- 
less possible to derive some explicit results by the Landau expansion when the 
temperature is close to the critical point and consequently q(x) is small. Let us 
briefly explain the essence of this procedure. 

When Jo = h = 0, the expansion of the free energy (2.17) to fourth order in 
dag turns out to be 


1 {1 /T? ə 1 i 
BF = lim — E (2 — 1) TO= gre 


=O TL ie 


4 2 9 AS 
~ TQ +5 pee ree D (3.43) 
arenes Y eG FB 
where we have dropped q-independent terms. The operator Tr here denotes the 
diagonal sum in the replica space. We have introduced the notation Qag = 


res * . . . 
“The symbol of configurational average |---| will be omitted in the present chapter as long 
as it does not lead to confusion. 


FULL RSB SOLUTION 33 


(8)?¢ag. Only the last term is relevant to the RSB. It can indeed be verified 
that the eigenvalue of the replicon mode that determines stability of the RS 
solution is, by setting the coefficient of the last term to —y (which is actually 

às = —16y8?, (3.44) 
where @ = (Tẹ — T)/T;. We may thus neglect all fourth-order terms except Q$ g 
to discuss the essential features of the RSB and let n — 0 to get 


1 © 
af = 3 | alpe- zae- ate) f Pwrdu+ Fate) }. as) 


The extremization condition with respect to g(a) is written explicitly as 


© 1 
2a) — 20%(2)— | a(udy—2atz) f audu $e) =. (3.46) 


Differentiation of this formula gives 


1 
|0| — zal) ~ f qly)dy + q(x) =0 or qg(z)=0. (3.47) 
ae i 
Still further differentiation leads to 
z 


qa) = z q‘(x) = 0. (3.48) 


The RS solution corresponds to a constant q(x). This constant is equal to |0| 
according to (3.46). There also exists an z-dependent solution 


q(z)=5 (0S <a = 24(1) (3.49) 
qlz) =q) (z1 <z<1). (3.50) 


By inserting this solution in the variational condition (3.46), we obtain 
2 

g(1) = |6| + 0(8). (3.51) 
Figure 3.2 shows the resulting behaviour of g(x) near the critical point where 6 
is close to zero. 
3.3.3 Vertical phase boundary 
The susceptibility is a constant y = 1/J near the critical point Tẹ because the 
that this result remains valid not only near Tr but over the whole temperature 
range below the critical point. We use this fact to show that the phase boundary 


between the spin glass and ferromagnetic phases is a vertical line at Jo = J as 
in Fig. 2.1 (Toulouse 1980). 
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q(x) 


0 X] l 


Fic. 3.2. q(x) near the critical point 


The Hamiltonian of the SK model (2.7) suggests that a change of the centre 
of distribution of J;; from 0 to Jo/N shifts the energy per spin by — Jom? /2. 


Thus the free energy f(T, m, Jo) as a function of T and m satisfies 


F(T, m, Jo) = f(T, m, 0) — z Jom?. 


From the thermodynamic relation 
Əf(T,m,0) 
ðm 
and the fact that m = 0 when Jo = 0 and h = 0, we obtain 
—_ (am i _ PI{(T,m,0) 
K Oh jpo ðm? 


Thus, for sufficiently small m, we have 


h 


m—-0 


1 
ee mM, 0) = foa(T) + 5x me. 
Combining (3.52) and (3.55) gives 


; 1 ‘ 
f(T, m, Jo) = fo(T) + za — Jo)m?. 


(3.52) 


This formula shows that the coefficient of m? in f(T, m, Jo) vanishes when y = 
1/Jo and therefore there is a phase transition between the ferromagnetic and 
non-ferromagnetic phases according to the Landau theory. Since y = 1/J in the 
whole range T < Tr, we conclude that the boundary between the ferromagnetic 


and spin glass phases exists at Jo = J. 


Stability analysis of the Parisi solution has revealed that the eigenvalue of 
the replicon mode is zero, implying marginal stability of the Parisi RSB solution. 
No other solutions have been found with a non-negative replicon eigenvalue of 


the Hessian. 
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state 


(a) (b) 


Fic. 3.3. Simple free energy (a) and multivalley structure (b) 


3.4 Physical significance of RSB 


The RSB of the Parisi type has been introduced as a mathematical tool to 
resolve controversies in the RS solution. It has, however, been discovered that 
the solution has a profound physical significance. The main results are sketched 
here (Mézard et al. 1987; Binder and Young 1986). 


3.4.1 Multivalley structure 


In a ferromagnet the free energy as a function of the state of the system has 
a simple structure as depicted in Fig. 3.3(a). The free energy of the spin glass 
state, on the other hand, is considered to have many minima as in Fig. 3.3(b), 
and the barriers between them are expected to grow indefinitely as the system 
size increases. It is possible to give a clear interpretation of the RSB solution if 
we accept this physical picture. 

Suppose that the system size is large but not infinite. Then the system is 
trapped in the valley around one of the minima of the free energy for quite 
a long time. However, after a very long time, the system climbs the barriers 
and reaches all valleys eventually. Hence, within some limited time scale, the 
physical properties of a system are determined by one of the valleys. But, after 
an extremely long time, one would observe behaviour reflecting the properties 
of all the valleys. This latter situation is the one assumed in the conventional 
formulation of equilibrium statistical mechanics, 

We now label free energy valleys by the index a and write m? = (5;)q for the 
magnetization calculated by restricting the system to a specific valley a. This is 
analogous to the restriction of states to those with m > 0 (neglecting m < 0) in 
a simple ferromagnet. 


3.4.2 gga and | 

To understand the spin ordering in a single valley, it is necessary to take the 
thermodynamic limit to separate the valley from the others by increasing the 
barriers indefinitely. Then we may ignore transitions between valleys and observe 
the long-time behaviour of the system in a valley. It therefore makes sense to 
define the order parameter qga for a single valley as 
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lim 


JEA = dm [(Si(to)Si(to + t)}. (3.57) 
This quantity measures the similarity (or overlap) of a spin state at site 7 after a 
long time to the initial condition at to. The physical significance of this quantity 
suggests its equivalence to the average of the squared local magnetization (m?)?: 


gea = |J Pa (mG)? | = | So Pas Do (mi)’} (3.58) 
a a a 


Here P, is the probability that the system is located in a valley (a pure state) 
a, that is P, = e~°**/Z. In the second equality of (3.58), we have assumed that 
the averaged squared local magnetization does not depend on the location. 

We may also define another order parameter J that represents the average 
over all valleys corresponding to the long-time observation (the usual statistical- 
mechanical average). This order parameter can be expressed explicitly as 


; 1 l 
q= ( ) Pym?) | = ) P,Pymom®| = N ) PaPe ) mim?|, (8.59) 
a ab ab i 


which is rewritten using m; = $, Pam? as 
q = [må] = [(5i)”]. (3.60) 


As one can see from (3.59), g is the average with overlaps between valleys taken 
into account and is an appropriate quantity for time scales longer than transition 
times between valleys. 

If there exists only a single valley (and its totally reflected state), the relation 
qea = ¢ should hold, but in general we have qua > g. The difference of these two 
order parameters qga —@ is a measure of the existence of a multivalley structure. 
We generally expect a continuous spectrum of order parameters between @ and 
dea corresponding to the variety of degrees of transitions between valleys. This 
would correspond to the continuous function q(x) of the Parisi RSB solution. 


3.4.3 Distribution of overlaps 
Similarity between two valleys a and b is measured by the overlap qap defined by 


1 
Jab = N > même. (3.61) 
i 


This qap takes its maximum when the two valleys a and b coincide and is zero 
when they are completely uncorrelated. Let us define the distribution of qab for 
a given random interaction J as 


P3(q) er. (ô(q gi qab)) = `> PoP, d(q qab), (3.62) 
ab 


PHYSICAL SIGNIFICANCE OF RSB 37 


P@) Pq) 


=m? 0 m’ 0 


(a) (b) 


Fic. 3.4. Distribution function P(q) of a simple system (a) and that of a system 
with multivalley structure (b) 


and write P(q) for the configurational average of P;(q): 
P(q) = [Ps(q)]- (3.63) 


In a simple system like a ferromagnet, there are only two different valleys con- 
nected by overall spin reversal and gap assumes only +m?. Then P(q) is consti- 
tuted only by two delta functions at q = +m?, Fig. 3.4(a). If there is a multivalley 
structure with continuously different states, on the other hand, qap assumes var- 
ious values and P(q) has a continuous part as in Fig. 3.4(b). 


3.4.4 Replica representation of the order parameter 


Let us further investigate the relationship between the RSB and the continuous 
part of the distribution function P(q). The quantity qag in the replica formalism 
is the overlap between two replicas a and 8 at. a specific site 


qag = (SPS?). (3.64) 


In the RSB this quantity has different values from one pair of replicas œp to 
another pair. The genuine statistical-mechanical average should be the mean of 
all possible values of qag and is identified with q defined in (3.59), 


1 
g = lim ~r B. 3.6! 
1= 20 n(n — 1) 2 fag oa 


The spin glass order parameter for a single valley, on the other hand, does not 
reflect. the difference between valleys caused by transitions between them and 
therefore is expected to be larger than any other possible values of the order 
parameter. We may then identify gga with the largest value of gag in the replica 
method: 
qea = max qag = max q(2). (3.66) 
(aß) x 


Let us define x(q) as the accumulated distribution of P(q): 
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dx 


(a) = fad P), E =PO (3.67) 


Using this definition and the fact that the statistical-mechanical average is the 
mean over all possible values of q, we may write 


1 1 
a= f dad PU) = | aoe (3.68) 


The two parameters qga and ĝ have thus been expressed by q(x). If there are 
many valleys, gag takes various values, and P(q) cannot be expressed simply in 
terms of two delta functions. The order function g(x) under such a circumstance 
has a non-trivial structure as one can see from (3.67), which corresponds to the 
RSB of Parisi type. The functional form of q(x) mentioned in §3.3.2 reflects the 
multivalley structure of the space of states of the spin glass phase. 


3.4.5 Ultrametricity 


The Parisi RSB solution shows a remarkable feature of ultrametricity. The con- 
figurational average of the distribution function between three different states 


P3(q1, 92,93) = >> PaPoP.6(q1 — dab)5(q2 — Goe)5(43 ~ Gea) (3.69) 


abe 


can be evaluated by the RSB method to yield 


[Ps (a1,a2,48)] = 5 P(ar)(ar)5(a1 ~ a2)5(a1 ~ as) 


+5 {P(q1) P(q2)O(q1 — g2)5(q2 — q3) + (two terms with 1,2,3 permuted)}. 


Here (q) has been defined in (3.67), and O(qı — q2) is the step function equal 
to 1 for q > q2 and 0 for q} < q2. The first term on the right hand side is 
non-vanishing only if the three overlaps are equal to each other, and the second 
term requires that the overlaps be the edges of an isosceles triangle (qi > q2, q2 = 
q3). This means that the distances between three states should form either an 
equilateral or an isosceles triangle. We may interpret this result as a tree-like 
(or equivalently, nested) structure of the space of states as in Fig. 3.5. A metric 
space where the distances between three points satisfy this condition is called an 
ultrametric space. 


3.5 TAP equation 

A different point of view on spin glasses is provided by the equation of state due to 
Thouless, Anderson, and Palmer (TAP) which concerns the local magnetization 
in spin glasses (Thouless et al. 1977). 
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qi 


q2 


Fic. 3.5. Tree-like and nested structures in an ultrametric space. The distance 
between C and D is equal to that between C and E and to that between D 
and E, which is smaller than that between A and C and that between C and 
F. 


3.5.1 TAP equation 


The local magnetization of the SK model satisfies the following TAP equation, 
given the random interactions J = {Jj;}: 


m; = tanh 8 2 Jijmj + hi ~ 2 FC se ms )mj : (3.70) 


The first term on the right hand side represents the usual internal field, a gen- 
eralization of (1.19). The third term is called the reaction field of Onsager and 
is added to remove the effects of self-response in the following sense. The mag- 
netization m; affects site 7 through the internal field Jjjm; that changes the 
magnetization of site j by the amount XjjJijmi. Here 


om; 
Xjj = ra = G(1—m%). (3.71) 
i Inyo 


Then the internal field at site ¿ would increase by 
F = wom TA 2 f 2 YY) « ey 
Jij Xij Jij Mi = JA rer m5 Mi. (3.72) 


The internal field at site į should not include such a rebound of itself. The third 
term on the right hand side of (3.70) removes this effect. In a usual ferromagnet 
with infinite-range interactions, the interaction scales as J;; = J/N and the 
third term is negligible since it is of O(1/N). In the SK model, however, we have 
Jz, = O(1/N) and the third term is of the same order as the first and second 
terms and cannot be neglected. The TAP equation gives a basis to treat. the spin 
glass problem without taking the configurational average over the distribution 
of J. 
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The TAP equation (3.70) corresponds to the extremization condition of the 
following free energy: 


ery 12 
Mi 


+ Mi 1—m; aoe 
+s refarat 5 (1 — m;) log 5 \. (3.73) 


The first and second terms on the right hand side are for the internal energy, and 
the final term denotes the entropy. The third term corresponds to the reaction 
field, 

This free energy can be derived from an expansion of the free energy with 
magnetization specified 


—Bf (a, B,m) = log Tre7F 4) — o> ham, (3.74) 
where H(a) = aHp ~— 0, hiS; and m = {m;}. The Hamiltonian Ho denotes 
the usual SK model, and h;(a,3,m) is the Lagrange multiplier to enforce the 
constraint mi = (Si)a, where (Ja is the thermal average with H(a). Expanding 
f to second order in @ around a = 0 and setting a equal to one, we can derive 


(3.73). This is called the Plefka expansion (Plefka 1982). 
To see it, let us first carry out the differentiations 


ba (Ho) (3.75) 
2 
ame — [3 (a (1 = (Hoja a ae Si-m) ) . (3.76) 


The first two terms of the expansion f(1) ~ f(0) + f’(0) give (3.73) except the 
Onsager term with J?, since 


" l itm: ltm l-m; Tm 
ŽO) = ry ( tm: i Emi i m ie =) (3.77) 


1 i 
'(0) = E os JijMiMj. (3.78) 
tj 


The second derivative (3.76) can be evaluated at a = 0 from the relation 


Oh; ə af | 7 
a T zA D ee De JijmMj. (3.79) 
Taaak TET Elan. JN 


Inserting this relation into (3.76) we find 
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~ To ; no 
f") = =, JÈ —m?)(1 — m?) (3.80) 
ea] 


which gives the Onsager term in (3.73). It has hence been shown that frap = 
f(0) + f’(0) + f”(0)/2. It can also be shown that the convergence condition of 
the above expansion for a > 1 is equivalent to the stability condition of the free 
energy that the eigenvalues of the Hessian { A? frap/ dm,0m,;} be non-negative. 
All higher order terms in the expansion vanish in the thermodynamic limit as 
long as the stability condition is satisfied. 


3.5.2 Cavity method 


The cavity method is useful to derive the TAP equation from a different. perspec- 
tive (Mézard et al. 1986, 1987). It also attracts attention in relation to practical 
algorithms to solve information processing problems (Opper and Saad 2001). 
The argument in the present section is restricted to the case of the SK model 
for simplicity (Opper and Winther 2001). 

Let us consider the local magnetization m; = (Si), where the thermal average 
is taken within a single valley. The goal is to show that this local magnetization 
satisfies the TAP equation. For this purpose it suffices to derive the distribution 
function of local spin P;(S;), with which the above thermal average is carried 
out. It is assumed for simplicity that there is no external field h; = 0. The local 
magnetization is determined by the local field hy = Das Jij5;. Hence the joint 
distribution of S; and h; can be written as 


P(S;, hi) x 5 P(A; \ Si), (3.81) 


where P(hi \ S;) is the distribution of the local field when the spin S; is removed 
from the system (i.e. when we set Jij = 0 for all j) (cavity field). More explicitly, 


Phy Sy) = Trsy s, ô(hi sa 5 JiS P(S \ Si), (3.82) 
j 


where P(S \ 5;) is the probability distribution of the whole system without 5; 
(Ji; = 0 for all 7). The distribution 


P,(S;) « i dh; efis: P(A, \ 8%) (3.83) 


is thus determined once P(A; \ S;) is known. 

In the SK model the range of interaction is unlimited and the number of 
terms appearing in the sum $ j JijS; is N—1. Ifall these terms are independent 
and identically distributed, then the central limit theorem assures that the cavity 
field h; is Gaussian distributed. This is certainly the case on the Bethe lattice 
(Fig. 3.6) that breaks up into independent trees as soon as a site is removed. Let 
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A i J 

Fic. 3.6. There are no loops of bonds on the Bethe lattice, so that removal of 
any single site breaks the system into independent trees. 

us assume that this is also true for the SK model in which the correlations of 


different sites are weak. We then have 
h; — (h 
exp l = (Fea = (hai)? \. (3.84) 


Combination of this Gaussian form and (3.83) yields 


m, = tanh Btha)\i- (3.85) 


PU Ng = 


It is therefore necessary to evaluate the average (hihi 
The standard average of the local field (without the cavity), 


together with the Gaussian cavity field (3.84) leads to the relation 
(ha) = (hiji + Vi(Si) (3.87) 
or, in terms of mi, 
(hi), = Jymy ~ Vimi. (3.88) 
j 
We then have to evaluate the variance of the local field 
Vi = $ Fig Tite (Sj Sehi — (Sihali). (3.89) 


ik 


Only the diagonal terms (j = k) survive in the above sum because of the clus- 
tering property in a single valley 


1 Ci 
HED ((Se55) — (SS)? > 0 (3.90) 
j,k 
as N -» oo as well as the independence of : and Jip (j #k). We then have 
aY GU- (GR) ~ 2 2(1 — (5y) = di 2(1— m? (3.91) 


From this, (3.85), and (3.88), we arrive at the TAP equation (3.70). 
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The equations of state within the RS ansatz (2.28) and (2.30) can be derived 
from the TAP equation with the cavity method taken into account. We first 
separate the interaction J;; into the ferromagnetic and random terms, 


= WN VN 


where 2;; is a Gaussian random variable with vanishing mean and unit variance. 
Then (3.88) is 


J (3.92) 


a Jo J m o 
(hihi = > ys Mz + -= > Zijm ~ Vimi. (3.93) 
N 7 VN F 


The first term on the right hand side is identified with Jom, where m is the fer- 
romagnetic order parameter. The effects of the third term, the cavity correction, 
can be taken into account by treating only the second term under the assumption 
that each term z;;m, is an independent quenched random variable; the whole 
expression of (3.93) is the thermal average of the cavity field and therefore the 
contribution from one site j would not interfere with that from another site 
j. Then the second term is Gaussian distributed according to the central limit 
theorem. The mean vanishes and the variance is 


5 S [zij Zin] myx = Sm = Nq. (3.94) 
j k 


j 


Thus the second term (with the third term taken into account) is expressed as 
V/Nqz with z being a Gaussian quenched random variable with vanishing mean 
and variance unity. Hence, averaging (3.85) over the distribution of z, we find 


m = fo tanh 8(Jom + y Jqz), (3.95) 


which is the RS equation of state (2.28). Averaging the square of (3.85) gives 
(2.30). One should remember that the amount of information in the TAP equa- 
tion is larger than that of the RS equations of state because the former is a 
set of equations for N variables whereas the latter is for the macroscopic order 
parameters m and q. It is also possible to derive the results of RSB calculations 
with more elaborate arguments (Mézard et al. 1986). 


3.5.3 Properties of the solution 


To investigate the behaviour of the solution of the TAP equation (3.70) around 
the spin glass transition point, we assume that both the m; and h; are small and 
expand the right hand side to first order to obtain 


m; =p ` Jigm; + Bhi ~ BI? mi. (3.96) 
J 
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This linear equation (3.96) can be solved by the eigenvalues and eigenvectors of 
the symmetric matrix J. Let us for this purpose expand Jij by its eigenvectors 
as 


Jig = $ INA) Ald). (3.97) 
À 
We define the -magnetization and A-field by 
My = Alim, hy = N Ali)h: (3.98) 
i i 


and rewrite (3.96) as 


my = BmadJy + Bhy — BJ? ma. (3.99) 
Thus the -susceptibility acquires the expression 
om 8 
= (3.100) 


X= Ohy ~ 1-BA+ (BI 
The eigenvalues of the random matrix J are known to be distributed between 
—2J and 2J with the density 
472 __ 
pa) = eah 
Qn J? 
It is thus clear from (3.100) that the susceptibility corresponding to the largest. 
eigenvalue Jy = 2J diverges at Tẹ = J, implying a phase transition. This transi- 
tion point Tp = J agrees with the replica result. 

In a uniform ferromagnet, the uniform magnetization corresponding to the 
conventional susceptibility (which diverges at the transition point) develops be- 
low the transition point to form an ordered phase. Susceptibilities to all other 
external fields (such as a field with random sign at each site) do not diverge at 
any temperature. In the SK model, by contrast, there is a continuous spectrum 
of J, and therefore, according to (3.100), various modes continue to diverge one 
after another below the transition point Tẹ where the mode with the largest 
eigenvalue shows a divergent susceptibility. In this sense there exist continuous 
phase transitions below Tp. This fact corresponds to the marginal stability of the 
Parisi solution with zero eigenvalue of the Hessian and is characteristic of the 
spin glass phase of the SK model. 

The local magnetization m? which appeared in the argument of the multival- 
ley structure in the previous section is considered to be the solution of the TAP 
equation that minimizes the free energy. Numerical analysis indicates that solu- 
tions of the TAP equation at low temperatures lie on the border of the stability 
condition, reminiscent of the marginal stability of the Parisi solution (Nemoto 
and Takayama 1985). General solutions of the TAP equation may, however, cor- 
respond to local minima, not the global minima of the free energy (3.73). It 
is indeed expected that the solutions satisfying the minimization condition of 
the free energy occupy only a fraction of the whole set of solutions of the TAP 
equation that has very many solutions of O(e*’) (a > 0). 


(3.101) 
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Bibliographical note 


Extensive accounts of the scheme of the RSB can be found in Mézard et al. 
(1987), Binder and Young (1986), and Fischer and Hertz (1991). All of these 
volumes and van Hemmen and Morgenstern (1987) cover most of the develop- 
ments related to the mean-field theory until the mid 1980s including dynamics 
and experiments not discussed in the present book. One of the major topics of 
high current activity in spin glass theory is slow dynamics. The goal is to clarify 
the mechanism of anomalous long relaxations in glassy systems at the mean-field 
level as well as in realistic finite-dimensional systems. Reviews on this problem 
are found in Young (1997) and Miyako et al. (2000). Another important issue 
is the existence and properties of the spin glass state in three (and other fi- 
nite) dimensions. The mean-field theory predicts a very complicated structure 
of the spin glass state as represented by the full RSB scheme for the SK model. 
Whether or not this picture applies to three-dimensional systems is not a trivial 
problem, and many theoretical and experimental investigations are still going 
on. The reader will find summaries of recent activities in the same volumes as 
above (Young 1997; Miyako et al. 2000). See also Dotsenko (2001) for the renor- 
malization group analyses of finite-dimensional systems. 


4 
GAUGE THEORY OF SPIN GLASSES 


We introduced the mean-field theory of spin glasses in the previous chapters 
and saw that a rich structure of the phase space emerges from the replica sym- 
metry breaking. The next important problem would be to study how reliable 
the predictions of mean-field theory are in realistic finite-dimensional systems. 
It is in general very difficult to investigate two- and three-dimensional systems 
by analytical methods, and current. studies in this field are predominantly by 
numerical methods. It is not the purpose of this book, however, to review the 
status of numerical calculations; we instead introduce a different type of argu- 
ment, the gauge theory, which uses the symmetry of the system to derive a 
number of rigorous/exact results. The gauge theory does not directly answer the 
problem of the existence of the spin glass phase in finite dimensions. Nevertheless 
it places strong constraints on the possible structure of the phase diagram. Also, 
the gauge theory will be found to be closely related to the Bayesian method 
frequently encountered in information processing problems to be discussed in 
subsequent chapters. 


4.1 Phase diagram of finite-dimensional systems 


The SK model may be regarded as the Edwards~Anderson model in the limit of 
infinite spatial dimension. The phase diagram of the finite-dimensional +J Ising 
model (2.3) is expected to have a structure like Fig. 4.1. The case of p = 1 is 
the pure ferromagnetic Ising model with a ferromagnetic phase for T < Te and 
paramagnetic phase for T > Tọ. As p decreases, antiferromagnetic interactions 
gradually destabilize the ferromagnetic phase, resulting in a decreased transition 
temperature. The ferromagnetic phase eventually disappears completely for p 
below a threshold pe. Numerical evidence shows that the spin glass phase ex- 
ists adjacent to the ferromagnetic phase if the spatial dimensionality is three 
or larger. There might be a mixed phase with the RSB at low temperatures 
within the ferromagnetic phase. The Gaussian model (2.2) is expected to have 
an analogous phase diagram. 

It is very difficult to determine the structure of this phase diagram accurately, 
and active investigations, mainly numerical, are still going on. The gauge theory 
does not give a direct answer to the existence problem of the spin glass phase 
in finite dimensions, but it provides a number of powerful tools to restrict pos- 
sibilities. It also gives the exact. solution for the energy under certain conditions 
(Nishimori 1980, 1981; Morita and Horiguchi 1980; Horiguchi 1981). 
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GAUGE TRANSFORMATION AT 


p=0 P, p=l 


Fic. 4.1. Phase diagram of the +J model 


4.2 Gauge transformation 


Let us consider the symmetry of the Edwards-Anderson model 


H =-Y Jys:S; (4.1) 
(ij) 


to show that a simple transformation of variables using the symmetry leads to a 
number of non-trivial conclusions. We do not restrict the range of the sum (ij) 
in (4.1) here; it could be over nearest neighbours or it may include farther pairs. 
We first discuss the +J model and add comments on the Gaussian and other 
models later. 

We define the gauge transformation of spins and interactions as follows: 


Si > SiTi, Jij > Jyoia;. (4.2) 


Here o; is an Ising spin variable at site i fixed to either 1 or —1 arbitrarily 
independently of S;. This transformation is performed at all sites. Then the 
Hamiltonian (4.1) is transformed as 


H > ) digo; : Sisi ; 857; = H, (4.3) 
(ii) 
which shows that the Hamiltonian is gauge invariant. 


To see how the probability distribution (2.3) of the +J model changes under 
the gauge transformation, it is convenient to rewrite the expression (2.3) as 


ef» Tij 


P(Jiz) = Deon Kk, 


(4.4) 


where Kp is a function of the probability p, 


48 GAUGE THEORY OF SPIN GLASSES 


Ky — _P 4.5 
e E (4.5) 


In (4.4), tij is defined as Ji; = JTij, the sign of Jij. It is straightforward to 
check that (4.4) agrees with (2.3) by inserting 7; = 1 or —1 and using (4.5). The 
distribution function transforms by the gauge transformation as 


eh pTig Mj 


P( Jig) > (4.6) 


— 2Qcosh K p 


Thus the distribution function is not gauge invariant. The Gaussian distribution 
function transforms as 


1 Jz + JE Jo 


It should be noted here that the arguments developed below in the present 
chapter apply to a more generic model with a distribution function of the form 
P(. Jiz) = Po(| Jij jhe MIR (4.8) 
The Gaussian model clearly has this form with a = Jo/J?, and the same is true 
for the +J model since its distribution can be expressed as 
Kp ti 
et pTij 
————-—- {6(7; 
2cosh Kp Cli 
for which we choose a = Kp/J. The probability distribution (4.8) transforms as 


P( Jig) > Polly) enaa, (4.10) 


P( Jay) = po(Tij == 1) -+ (1 —~ p) (Tij “+ 1) = — 1) + 6(Fij -+ 1)} (4.9) 


4.3 Exact solution for the internal energy 


An appropriate application of gauge transformation allows us to calculate the 
exact value of the internal energy of the Edwards-Anderson model under a cer- 
tain condition. We mainly explain the case of the +J model and state only the 
results for the Gaussian model. Other models in the class (4.8) can be treated 
analogously. 


4.3.1 Application of gauge transformation 
The internal energy is the configurational average of the statistical-mechanical 


average of the Hamiltonian: 


peon -5 exp(Kp ig) Tid) 
~ | Tree AH (2 cosh K,)%*8 
i Trs ( ae! oii) ‘ys Sj j)}exp(K ii) Tiji S; a) 
Trg exp( J ij) Tij SiSj) 


Here Trg denotes the sum over S = {S; = +1}, K = 8J, and Nz is the number 
of terms in the sum )/,,,, or the total number of interaction bonds, Ng = |B]. 


(4.11) 
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We write Tr for the sum over spin variables at sites and reserve the symbol 5°, 
for variables {7;;} defined on bonds. 

Let us now perform gauge transformation. The gauge transformation (4.2) 
just changes the order of sums in Trg appearing in (4.11) and those in 5°_. For 
example, the sum over S; = +1 in the order ‘+1 first and then ~—1’ is changed 
by the gauge transformation to the other order ‘~1 first and then 1’, if o; = ~1. 
Thus the value of the internal energy is independent of the gauge transformation. 
We then have 


[E] = ` exp( Kp 5 T170107) Trs (-J > T1j SiS) exp( K y Tij 545;) (4.12) 

(2 cosh Kp)’ Trs exp(K $ 7;;5;5;) , 
where gauge invariance of the Hamiltonian has been used. It should further be 
noted that the value of the above formula does not depend on the choice of the 
gauge variables æ = {o;} (of the Ising type). This implies that the result remains 
unchanged if we sum the above equation over all possible values of æ and divide 
the result by 2^, the number of possible configurations of gauge variables: 


a ee 1 m NE ” ae A ‘ 
[E] = ING cosh Ky)Ne pare exp(Kp X Tij0105) 
Trs (—J $ Ti 5155) exp( E Tij Si5Sj) 

Trg exp(K Y Tij SiS) ` 


(4.13) 


4.3.2 Exact internal energy 


One observes in (4.13) that, if K = Kp, the sum over § in the denominator 
(the partition function) cancels out the sum over o that has been obtained by 
gauge transformation from the probability distribution P(J,;). Then the internal 
energy becomes 


5 1 v 
[E] za 2N (2 cosh K) Ns 5 Irs -J Y Ty SiS; | exp(K Ñ. 145 SiS5). (4.14) 
i (ij) 
The sums over r and S in (4.14) can be carried out as follows: 


e 


2 J ~ 0 g 
lapes 2N (2 cosh K)Ne 2, Irs OK mae d_75%55) 


Oe DENY 
~  2N(2Qcosh K)Ne AK Irs LI pa exp(K Tij SiS;) 
= -N pgJ tanh K. (4.15) 


This is the exact solution for the internal energy under the condition K = Kp. 
The above calculations hold for any lattice. Special features of each lattice are 
reflected only through Np, the total number of bonds. 
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(b) 


Fic. 4.2. Nishimori line (dashed) in the (a) +J and (b) Gaussian models 


4.3.3 Relation with the phase diagram 

The condition K = Kp relates the temperature T (= J/K) and the probability 
p(= (tanh Kp + 1)/2), which defines a curve in the T-p phase diagram. The 
curve K = Kp is called the Nishimori line and connects (T = 0,p = 1) and 
(T = 00, p = 1/2) in the phase diagram of the +J model (Fig. 4.2(a)). 

The exact internal energy (4.15) on the Nishimori line has no singularity as 
a function of the temperature. The Nishimori line, on the other hand, extends 
from the ferromagnetic ground state at (T = 0,p = 1) to the high-temperature 
limit (T = œ, p = 1/2) as shown in Fig. 4.2(a); it inevitably goes across a phase 
boundary. It might seem strange that the internal energy is non-singular when 
the line crosses a phase boundary at a transition point. We should, however, 
accept those two apparently contradicting results since (4.15) is, after all, the 
exact solution.? One possibility is that the singular part of the internal energy 
happens to vanish on the Nishimori line. This is probably a feature only of the 
internal energy, and the other physical quantities (e.g. the free energy, specific 
heat, and magnetic susceptibility) should have singularities at the crossing point. 
In almost all cases investigated so far, this transition point on the Nishimori line 
is a multicritical point where paramagnetic, ferromagnetic, and spin glass phases 
merge. 

Similar arguments apply to the Gaussian model. The Nishimori line in this 
case is Jo/J? = 3, from the cancellation condition of numerator and denominator 
as in the previous subsection. It is shown as the dashed line in Fig. 4.2(b). The 
energy for Jo/ J? = B is 

[E] = -Ng Jo. (4.16) 


It is possible to confirm (4.16) for the infinite-range version of the Gaussian 
model, the SK model, with h = 0. One can easily verify that m = q from the RS 
solution (2.28) and (2.30) under the condition 8J? = Jo. The internal energy is, 
from the free energy (2.27), 


The existence of a finite region of ferromagnetic phase in two and higher dimensions has 
been proved (Horiguchi and Morita 1982a, b). 
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N a 2, i : 
[E] = a { Jom? + BJ?(1 — q?)}. (4.17) 
Insertion of m = q and 8J? = Jo into the above formula gives [E] = —JoN/2, 


which agrees in the limit N — oo with (4.16) with Ng = N(N — 1)/2 and 
Jo — Jo/N. Therefore the RS solution of the SK model is exact on the Nishimori 
line at least as far as the internal energy is concerned. The AT line lies below the 
Nishimori line and the RS solution is stable. It will indeed be proved in 84.6.3 
that the structure of the phase space is always simple on the Nishimori line. 


od 


4.3.4 Distribution of the local energy 
We can calculate the expectation value of the distribution function of the energy 
of a single bond Jij SiS; 

P(E) = [(0(E — JizSiS5))| (4.18) 


by the same method as above (Nishimori 1986a). Since 6(£ — J;;5;5;) is gauge 
invariant, arguments in 84.3 apply and the following relation corresponding to 
(4.14) is derived when K = Kp: 
B | 

2N (2 cosh K)Ne 


P(E) p3 Trs (E — Jij SaS) exp(K `. TimotSm). (4.19) 


rT 


Summing over bond variables other than the specific one (ij), which we are 
treating, can be carried out. The result cancels out with the corresponding factors 
in the denominator. The problem then reduces to the sum over the three variables 
Tij, Si, and S;, which is easily performed to yield 


P(E) = pô(E — J) + (1 - p)ô(E + J). (4.20) 


It is also possible to show that the simultaneous distribution of two different 
bonds is decoupled to the product of distributions of single bonds when K = Kp: 


P2(Ey, E2) = [(8(Er — Jig SiS5)6( Ee — JSk) = P(E) P(E2). (4.21) 


The same holds for distributions of more than two bonds. According to (4.20) and 
(4.21), when K = Kp, the local energy of a bond is determined independently 
of the other bonds or spin variables on average but depends only on the original 
distribution function (2.3). The same is true for the Gaussian model. 


4.3.5 Distribution of the local field 
The distribution function of the local field to site 7 


P(h) = ((5(h— F Jus) (4.22) 
j 
can be evaluated exactly if K = Kp by the same method (Nishimori 1986a). 


Since the Hamiltonian (4.1) is invariant under the overall spin flip S; — —S; (vi), 
(4.22) is equal to 
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P(A) = =[(5(h — 2 JigS3)) + (8lh + 3 JigS3))| (4.23) 


which is manifestly gauge invariant. We can therefore evaluate it as before under 
the condition K = Kp. After gauge transformation and cancellation of denomi- 
nator and numerator, one is left with the variables related to site 7 to find 


Ae (2 cosh K)* >, 9 exp(A > Ta) 
1 
= (cosh Ke 2 ô(h- J 2 Tij) cosh Bh, (4.24) 


where z is the coordination number, and the sum over r runs over the bonds 
connected to i. This result shows again that each bond connected to i behaves 
independently of other bonds and spins with the appropriate probability weight 
p = e*»/2coshK, or 1 — p = e~*»/2coshK,. The same argument for the 
Gaussian model feads to the distribution 


B 1 (h — z6JY _{ (h+28J)? : 
P(h) = 2 Janz I {exp (2) -fe exp ( A > (4.25) 


when 8J? = Jo. 


4.4 Bound on the specific heat 


It is not possible to derive the exact solution of the specific heat. We can never- 
theless estimate its upper bound, The specific heat is the temperature derivative 
of the internal energy: 


T°[C] =- 


f r 2e-BH -BHN ? 
ðE] _ Pad e (mier ) | da 


Op "Trg e~ SH Trg e~ FH 


For the +J model, the first term of the above expression (= C4) can be calculated 
in the same manner as before: 


C, = exp(Kp do Tij) Irs (~J D Tiy S455)” exp(K D Tig S155) 
i (2 cosh K,)X# Trg exp(K $ Tij S493) 
= 2N (2 cosh K,)N# 2 Ita exp(Kp ) | 715173) 
(0? /0K*) Trg exp(K X? 135153) 
Trs exp(K >> %;5;5;) 


Cancellation of the denominator and numerator is observed when K = Kp to 
give 


(4.27) 


J? 


a ae a? 
! 2N(Qcosh K)Ne 0K? 


——~ Trg exp (ES nya Si) 
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2 _ A 

ee ee, 
2N(2cosh K)» OK? 

= J?(N2 tanh? K + Ngsech?K). (4.28) 


2cosh K)%*# 


The second term on the right hand side of (4.26), C2, cannot be evaluated directly 
but its lower bound is obtained by the Schwarz inequality: 


C2 = [E*] > [EP = JN? tanh? K. (4.29) 


From (4.28) and (4.29), 
T?(C] < J? Ngsech?K. (4.30) 


Hence the specific heat on the Nishimori line does not diverge although the line 
crosses a phase boundary and thus the specific heat would be singular at the 
transition point. 

For the Gaussian distribution, the upper bound on the specific heat is 


TC] < J?Np. (4.31) 


4.5 Bound on the free energy and internal energy 


We can derive an interesting inequality involving the free energy and, simulta- 
neously, rederive the internal energy and the bound on the specific heat from an 
inequality on the Kullback~-Leibler divergence (Iba 1999). Let us suppose that 
P(x) and Q(x) are probability distribution functions of a stochastic variable x. 
These functions satisfy the normalization condition $3, P(x) = 32, Q(x) = 1. 
The following quantity is called the Kullback-Leibler divergence of P(x) and 
Q(x): 

P(e) 

Q(x) 
Since G vanishes when P(x) = Q(x) (Va), it measures the similarity of the two 
distributions. The Kullback~Leibler divergence is also called the relative entropy. 


G = > P(x) log (4.32) 


The Kullback—Leibler divergence is positive semi-definite: 
P(x) Q(z) } i 
G = P(x fio i 1> >20. 4.33 
2 Ple) e 5G) + Pe) —Y ve 


Here we have used the inequality — log y+y—1 > 0 for positive y. The inequality 
(4.33) leads to an inequality on the free energy. We restrict ourselves to the +J 
model for simplicity. 

Let us choose the set of signs T = {T;;} of Jig = JTij as the stochastic variable 


x and define P(x) and Q(x) as 


7 Tre exp(Kp tis) Tiji) E Tro exp(K 2 lij) Tija) 


P = ; : ; = : . (4.34 
(7) 2N (2 cosh K,)N# nee 2N (2 cosh K)Ne sete) 
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It is easy to check that these functions satisfy the normalization condition. Then 
(4.32) has the following expression: 


Tro exp(Kp Y Tisis) 
G = : E d i 
De 2 (2 cosh K,)*® 


j {log Tre exp(Kp > Tijsi j) — log Tro exp( K > 7003) 
~ Ng log 2 cosh Kp + Ng log 2 cosh K. (4.35) 


This equation can be shown to be equivalent to the following relation by using 
gauge transformation: 


G- De exp(Kp 90 Tij) 


(2 cosh Kp)» 
; fi Tre exp( Kp oy 743010;) — log Trg exp(K 5 Tjoiog)} 
— Np log 2 cosh Kp + Np log 2 cosh K. (4.36) 


The second term on the right hand side of (4.36) is nothing more than the 
logarithm of the partition function of the +J model, the configurational average 
of which is the free energy F(K, p) divided by temperature. The first term on 
the right. hand side is the same quantity on the Nishimori line (K = Kp). Hence 
the inequality G > 0 is rewritten as 


GF (K, p) + Ng log2 cosh K > 6,)F (Kp, p) + Np log 2 cosh Kp. (4.37) 


The function BFo(K) = —Nep log 2 cosh K is equal to the free energy of the one- 
dimensional +J model with Ng bonds and free boundary condition. Then (4.37) 
is written as 


BLE (K, p) ~ Fo(K)} > Bp{ (Kp, p) ~ Fo(Kp)}- (4.38) 


This inequality suggests that the system becomes closest to the one-dimensional 
model on the Nishimori line as far as the free energy is concerned. 

Let us write the left hand side of (4.38) as g(K,p). Minimization of g(K, p) 
at K = Ky gives the following relations: 


oe w 2 
GK, p) ear > 0. (4.39) 
Ok K=K, oK: K=K, 


The equality in the second relation holds when g(K, p) is flat at K = Kp. This 
happens, for instance, for the one-dimensional model where g(K, p) = 0 identi- 
cally. However, such a case is exceptional and the strict inequality holds in most 
systems. By noting that the derivative of GF by ( is the internal energy, we can 
confirm that the first equation of (4.39) agrees with the exact solution for the 
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internal energy (4.15). The second formula of (4.39) is seen to be equivalent to 
the upper bound of the specific heat (4.30). 

When the strict inequality holds in the second relation of (4.39), the following 
inequality follows for K close to Kp: 


Og(Kyp) _ <0 (K < Kp) 


J ya E] + NgJ tanh K { >0 (K > Kp). (4.40) 
The energy thus satisfies 
E <—NgJtanhK (T >T, =J/Kp) 
[e l > —NgJtanhK (T <T) ay 


when T is not far away from Tp. The internal energy for generic T and p is smaller 
(larger) than the one-dimensional value when T is slightly larger (smaller) than 
Tp corresponding to the point on the Nishimori line for the given p. 


4.6 Correlation functions 


One can apply the gauge theory to correlation functions to derive an upper 
bound that strongly restricts the possible structure of the phase diagram (Nishi- 
mori 1981; Horiguchi and Morita 1981). For simplicity, the formulae below are 
written only in terms of two-point correlation functions (the expectation value 
of the product of two spin variables) although the same arguments apply to any 
other many-point correlation functions. It will also be shown that the distribu- 
tion function of the spin glass order parameter has a simple structure on the 
Nishimori line (Nishimori and Sherrington 2001; Gillin et al. 2001) and that the 
spin configuration is a non-monotonic function of the temperature (Nishimori 
1993). 


4.6.1 Identities 
Let us consider the +J model. The two-point correlation function is defined by 


Trs SoS,e7 FH 
[(SoS+)«] = Se 


7e y exp(Kp 5 Tij) , Trs Po fo exp( K Tij SiS; ) 


4.42 
(2 cosh Kp) ^» Trs exp(K $> Ti SiS;) nee 


x 
A gauge transformation changes the above expression to 


1 = A 
KNS = 2N (2 cosh Kp) ^”? » Tro aor exp(Kp > T7103) 


Trs SoS, exp(K Y Ti SiS) 
Trs exp(K Y nsis) 


(4.43) 


We do not observe a cancellation of the numerator and denominator here even 
when K = Kp because of the factor ooo, caused by the gauge transformation of 
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SoSr. However, an interesting identity results if one inserts the partition function 
to the numerator and denominator: 


1 - 
[(SoS+) K] = 2N (2 cosh Kp) Ne 2 {Tro exp(Kp ` rij0103) 


(Tto 909% exp(Kp Ý Tij010;) af Trg SoS, exp(K Y Tiz SiS) (4.44) 
Tro exp( Kp >> Tiji) Trs exp(K 97 Tij SiS) b 


The last two factors here represent correlation functions (oo0r) K, and (SoS;) Kk 
for interaction strengths Kp and K, respectively. The above expression turns out 
to be equal to the configurational average of the product of these two correlation 
functions 

(SoS) K] = (coor) K, (SoS) z]. (4.45) 


To see this, we write the definition of the right hand side as 


1 , 
[(T00r)K, (SoSriK] = (2 woah K,) No 2, exp(Kp a Tij) 


Trg SoSr exp(Kp $ Tiz SS) Trs SoSrexp(K 97 Tiz SiS) 
Trs exp(Kp >> Tij SiS) Trs exp(K YO Tij SaS) i 


(4.46) 


Here we have used the variable $ instead of ø in writing (oo0,) x,- The result is 
independent of such a choice because, after all, we sum it over +1. The product 
of the two correlation functions in (4.46) is clearly gauge invariant. Thus, after 
gauge transformation, (4.46) is seen to be equal to (4.44), and (4.45) has been 
proved, 

By taking the limit r — oo in (4.45) with K = Kp, site 0 should become 
independent of site r so that the left hand side would approach [(So) «1][(S;) «|, 
the square of the ferromagnetic order parameter m. The right hand side, on the 
other hand, approaches [(o0) « (So) K|[(or) Kk (Sr) «|, the square of the spin glass 
order parameter q. We therefore have m = q on the Nishimori line. Since the spin 
glass phase has m = 0 and q > 0 by definition, we conclude that the Nishimori 
line never enters the spin glass phase (if any). In other words, we have obtained 
a restriction on the possible location of the spin glass phase. The result m =q 
will be confirmed from a different argument in §4.6.3. 

Another interesting relation on the correlation function can be derived as 
follows. Let us consider the configurational average of the inverse of the correla- 
tion function [(S95;,) g]. The same manipulation as above leads to the following 


relation: l | | 
Toor) K, 
AA bene |" 4.47 
Eon | (SoSp) Kk | ( 7) 


The right hand side is unity if K = K,. Therefore the expectation value of the 
inverse of an arbitrary correlation function is one on the Nishimori line. This 
is not as unnatural as it might seem; the inverse correlation (S056) is either 
greater than 1 or less than —1 depending upon its sign. The former contribution 
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dominates on the Nishimori line (a part of which lies within the ferromagnetic 
phase), resulting in the positive constant. 

A more general relation holds if we multiply the numerator of the above 
equation (4.47) by an arbitrary gauge-invariant quantity Q: 


Here Q may be, for instance, the energy at an arbitrary temperature (H) x or the 
absolute value of an arbitrary correlation |(S;5;) «|. This identity (4.48) shows 
that the gauge-invariant quantity Q is completely uncorrelated with the inverse 
of the correlation function on the Nishimori line because the configurational 
average decouples: 


exsr a) (4.49) 


It is somewhat counter-intuitive that any gauge-invariant quantity Q takes a 
value independent of an arbitrarily chosen inverse correlation function. More 
work should be done to clarify the significance of this result. 

4.6.2 Restrictions on the phase diagram 


A useful inequality is derived from the correlation identity (4.45). By taking the 
absolute values of both sides of (4.45), we find 


I[(S0S>) x] 


= |[(o00r) x, (SoSr)x|| < [|(ooor) x, KSS] < [|(coor) x, |] - 

(4.50) 
It has been used here that an upper bound is obtained by taking the absolute 
value before the expectation value and that the correlation function does not 
exceed unity. 

The right hand side of (4.50) represents a correlation function on the Nishi- 
mori line K = K,. This is a correlation between site 0 and site r, ignoring the 
sign of the usual correlation function (¢00;-) x,,. It does not decay with increasing 
r if spins are frozen at each site as in the spin glass and ferromagnetic phases. 
The left hand side, on the other hand, reduces to the square of the usual ferro- 
magnetic order parameter in the limit r -> oo and therefore approaches zero in 
the spin glass and paramagnetic phases. The right hand side of (4.50) vanishes 
as r — oo if the point on the Nishimori line corresponding to a given p lies 
within the paramagnetic phase. Then the left hand side vanishes irrespective of 
K, implying the absence of ferromagnetic ordering. This fact can be interpreted 
as follows. 

Let us define a point A as the crossing of the vertical (constant p) line L 


the paramagnetic phase as in Fig. 4.3, no point on L is in the ferromagnetic 
phase by the above argument. We therefore conclude that the phase boundary 
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between the ferromagnetic and spin glass (or paramagnetic, in the absence of the 
spin glass phase) phases does not extend below the spin glass (or paramagnetic) 
phase like C,. The boundary is either vertical as Cz or re-entrant as C3 where 
the spin glass (or paramagnetic) phase lies below the ferromagnetic phase. It 
can also be seen that there is no ferromagnetic phase to the left of the crossing 
point M of the Nishimori line and the boundary between the ferromagnetic and 
non-ferromagnetic phases. It is then natural to expect that M is a multicritical 
point (where paramagnetic, ferromagnetic, and spin glass phases merge) as has 
been confirmed by the renormalization-group and numerical calculations cited 
at the end of this chapter. 


4.6.3 Distribution of order parameters 


The remarkable simplicity of the exact energy (4.15) and independence of energy 
distribution (4.21) suggest that the state of the system would be a simple one on 
the Nishimori line. This observation is reinforced by the relation q = m, which is 
interpreted as the absence of the spin glass phase. We show in the present subsec- 
tion that a general relation between the order parameter distribution functions 
confirms this conclusion. In particular, we prove that the distribution functions 
of q and m coincide, P,(#) = Pm(«), a generalization of the relation q = m. Since 
the magnetization shows no RSB, the structure of P,,,(x) is simple (i.e. composed 
of at most two delta functions). Then the relation Py(#) = Pm(«x) implies that 
P(x) is also simple, leading to the absence of a complicated structure of the 
phase space on the Nishimori line. 

The distribution function of the spin glass order parameter for a generic 
finite-dimensional system is defined using two replicas of the system with spins 
o and S: 
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Tre Trg 6(a4 — + Sisi) e-BH(e)-BH(S) 
Tre Trg e78 (e)-BH(S) 


P,(x) = 


where the Hamiltonians are 
H(o) = — Y Jra H(S) = -— >: Jas. (4.52) 


The two replicas share the same set of bonds J. There is no interaction between 
the two replicas. If the system has a complicated phase space, the spins take 
various different states so that the overlap of two independent spin configurations 
X; oisi /N is expected to have more than two values (+m?) for a finite-step RSB 
or a continuous spectrum for full RSB as in the SK model. 

It is convenient to define a generalization of P,(a) that compares spin con- 
figurations at. two different (inverse) temperatures 8; and (2, 


Tro Trs e-f: H(o)—G2H(S) 


Py(2; AJ, B27) = (4.53) 


The distribution of magnetization is defined similarly but without replicas: 


iets yee 


By applying the same procedure as in §4.6.1 to the right hand side of (4.54), we 
find 


1 
Pint) = 2N (2cosh Kp) NP 2, Te exp(Kp X Tig Ti) 


Tre Trs (z — + a TiSi) exp(Kp Y T0105) exp( K 2 Tig 5995) 
dYa Trs exp(Kp 5 7130103) exp(K >D Tij SiS) 


. (4.55) 


Similarly, gauge transformation of the right hand side of (4.53) yields the same 
expression as above when 1J = Kp and 2J = K. We therefore have 


Prlz; K) = P, (a; Kp, K). (4.56) 


Here the K-dependence of the left hand side has been written out explicitly. By 
setting K = Kp, we obtain P(x) = Pm(£). This relation shows that the phase 
space is simple on the Nishimori line as mentioned above. Comparison of the first 
moments of P(x) and P(x) proves that q = m. Since q = m means that the 
ordered state on the Nishimori line should be a ferromagnetic phase, the absence 
of complicated phase space implies the absence of a mixed phase (ferromagnetic 
phase with complicated phase space). 

Equation (4.56) may be interpreted that the spin state at K projected to the 
spin state at Kp (the right hand side) always has a simple distribution function 
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P(x) 


Se 


—m 2 0 m 


(a)K=K, (b) K> Kp 


Fic. 4.4. The distribution function of the spin glass order parameter on the 
Nishimori line is simple (a). If there exists a phase with full RSB immediately 


below the Nishimori line, the derivative of P(x) with respect to K should 
be positive in a finite range of x (b). 


(the left hand side). Physically, this implies that the spin state at Kp, on the 
Nishimori line, is much like a perfect ferromagnetic state since the left hand side 
of the above equation represents the distribution of the spin state relative to the 
perfect ferromagnetic state, 57 ,(S;-1). This observation will be confirmed from 
a different point of view in §4.6.4. 

More information can be extracted from (4.56) on the possible location of 
the AT line. Differentiation of both sides of (4.56) by K at K = Kp yields 

ð ð 


K=Kp 


-12 
ke R KaK, 
(4.57) 


OK 


The left hand side is 
oð 

OK 
which vanishes at almost all «(4 +m(K)). The right hand side thus vanishes at 
x #+m(K). It follows that there does not exist a phase with full RSB immedi- 
ately below the Nishimori line K = Kp; otherwise the derivative of P(x; K, K) 
with respect to the inverse temperature K should be positive in a finite range 
of x to lead to a continuous spectrum of P,(a2;K, K) at a point slightly below 
the Nishimori line, see Fig. 4.4. Clearly the same argument applies to any step 
of the RSB because the right hand side of (4.57) would have non-vanishing (di- 
vergent) values at some x Æ +m(K) if a finite-step RSB occurs just below the 
Nishimori line. Therefore we conclude that the Nishimori line does not coincide 
with the AT line marking the onset of RSB if any. Note that the present ar- 
gument does not exclude the anomalous possibility of the RSB just below the 
Nishimori line with infinitesimally slow emergence of the non-trivial structure 
like P,(ax) o Jaje 4K ~ Ke), 


Pilg Kys -55"(x — m(K))m'(K) + so +m(K))m'(K) (4.58) 
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It is possible to develop the same argument for the Gaussian model. 


4.6.4 Non-monotonicity of spin configurations 

Let us next investigate how spin orientations are aligned with each other when 
we neglect the reduction of spin magnitude by thermal fluctuations in the +J 
model. We only look at the sign of correlation functions: 


| (SoS) K 


= l T _, Trg SoS, exp(K $ 713 Si 5S;) 
ieee ~ (2cosh Kp) ^” >, exp(Kp S ra) 


Trg SoS, exp(K Y Tyas) 
(4.59) 


After gauge transformation, we find 


| (SoSr)K | 
|(SoS;) «| 
1 P : 
n 2N (2 cosh K,)N2 3 Iro exp(Kp X 7440105) (O0or) K, 


1 r ao H 
S 2N (2 cosh K,)N* 3 MD Sp > Tijo j) {00r}, l- (4.60) 


(SoSr) K 
(SoSr) K| 


We have taken the absolute value and replaced the sign of (SoS) by its upper 
bound 1. The right hand side is equivalent to 


fear | 


o 4.61 
o] oe) 


because, by rewriting (4.61) using the gauge transformation, we have 


| (Foor) K, | 
(oor) K,,| 
E E g E e D yaa) 
2N (2 cosh Kp)» < |Tro oo07 exp(Kp >) Ti 71075) | 
1 a 
=< 2N (2 cosh K,)Ne 2 | Iro 0007 exp(Kp 5S Tjoia)| 
1 ; P > 
= IN cosh Kyo 2, Tre exp(Kp > T1310) |(coor) K, | . (4.62) 


Thus the following relation has been proved: 
[sgn(coor) K] < [sgn(S05,) x,]- (4.63) 


This inequality shows that the expectation value of the relative orientation of 
two arbitrarily chosen spins is a maximum at K = Ky, as a function of K 
with p fixed. Spins become best aligned with each other on the Nishimori line 
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Fic. 4.5. Different bond configurations with the same distribution of frustra- 
tion. Bold lines denote antiferromagnetic interactions. Black dots indicate 
the frustrated plaquettes (fe = —1). One of the two configurations changes 
to the other by the gauge transformation with o; = —1. 


when the temperature is decreased from a high value at a fixed p, and then the 
relative orientation decreases as the temperature is further lowered, implying 
non-monotonic behaviour of spin alignment. Note that the correlation function 
itself [(So5,) g] is not expected to be a maximum on the Nishimori line. 


4.7 Entropy of frustration 


We further develop an argument for the +J model that the phase boundary 
below the Nishimori line is expected to be vertical like Cz of Fig. 4.3 (Nishimori 
19866). Starting from the definition of the configurational average of the free 
energy 


exp(Kp )) Tij) Ai. 
= 2. Boosh Keo (2 cosh K,)*2 log Irs exp(K J 7455155), (4.64) 


we can derive the following expression by gauge transformation under the con- 
dition K = Kp: 


1 
m a See a ) Tro exp(K 5 Tij0;10;) log Trs exp(K 5 Tij SiS) 
oN (2coshdt Am 
1 it see 
> 2N (2 cosh K)Ne 2 Z(K) log Z(K). (4.65) 


Let us recall here that Z(K) in front of log Z(K}) was obtained from the gauge 
transformation of the distribution function P(J;;) and the sum over gauge vari- 
ables. Since gauge transformation does not change the product of bonds fe = 
[I]. Ji; over an arbitrary closed loop c, this fe is a gauge-invariant quantity, 
called frustration (see Fig. 4.5).4 The sum of all bond configurations with the 


4A more accurate statement is that the loop c is said to be frustrated when fe < 0. One often 
talks about frustration of the smallest possible loop, a plaquette (the basic square composed of 
four bonds in the case of the square lattice, for example). 
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same distribution of frustration {f.} gives Z(Kp) (up to a normalization fac- 
tor). This Z(K,) is therefore identified with the probability of the distribution 
of frustration. Then, (4.65) may be regarded as the average of the logarithm of 
the probability of frustration distribution on the Nishimori line, which is nothing 
but the entropy of frustration distribution. We are therefore able to interpret. the 
free energy on the Nishimori line as the entropy of frustration distribution. 

It should be noted here that the distribution of frustration is determined 
only by the bond configuration J and is independent of temperature. Also, it 
is expected that the free energy is singular at the point M in Fig. 4.3 where 
the Nishimori line crosses the boundary between the ferromagnetic and non- 
ferromagnetic phases, leading to a singularity in the frustration distribution. 
These observations indicate that the singularity in the free energy at M is caused 
by a sudden change of the frustration distribution, which is of geometrical nature. 
In other words, the frustration distribution is singular at the same p(= pe) as 
the point M as one changes p with temperature fixed. This singularity should be 
reflected in singularities in physical quantities at p = pe. Our conclusion is that 
there is a vertical phase boundary at the same p as M. It should be remembered 
that singularities at higher temperatures than the point M are actually erased by 
large thermal fluctuations. This argument is not a rigorous proof for a vertical 
boundary, but existing numerical results are compatible with this conclusion (see 
the bibliographical note at the end of the chapter). 

Singularities in the distribution of frustration are purely of a geometrical 
nature independent of spin variables. It is therefore expected that the location 
of a vertical boundary is universal, shared by, for instance, the XY model on the 
same lattice if the distribution of J;; is the same (Nishimori 1992). 


4.8 Modified +/ model 


The existence of a vertical phase boundary discussed in $4.7 can be confirmed 
also from the following argument (Kitatani 1992). The probability distribution 
function of interactions of the +J model is given as in (4.4) for each bond. It 
is instructive to modify this distribution and introduce the modified +J model 
with the following distribution: 


exp{(Kp +a) Zuz Tig }2 (Kp, T) 


; 4.66 
(2 cosh Kp) Z(Kp +a,T) (aeo 


Py (Kp, A, T) z= 
where a is a real parameter. Equation (4.66) reduces to the usual +J model 
when a = 0. It is straightforward to show that (4.66) satisfies the normalization 
condition by summing it over 7 and then using gauge transformation. 


4.8.1 Expectation value of physical quantities 

The expectation value of a gauge-invariant quantity in the modified +J model 
coincides with that of the conventional +J model. We denote by {---}%, the 
configurational average by the probability (4.66) and [---]x, for the configura- 
tional average in the conventional +J model. To prove the coincidence, we first 
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write the definition of the configurational average of a gauge-invariant quantity 
Q in the modified +J model and apply gauge transformation to it. Then we sum 
it over the gauge variables ø to find that Z(Kp + 4,7) appearing in both the 
numerator and denominator cancel to give 


1 
2N (2 cosh Kp)’ 


{Qhk, = >. (Kp, 7)Q = [Q] r. (4.67) 
TT 

The final equality can be derived by applying gauge transformation to the defini- 

tion of [Q] x, and summing the result over gauge variables. Equation (4.67) shows 

that the configurational average of a gauge-invariant quantity is independent of 

a. 

Let us next derive a few relations for correlation functions by the same 
method as in the previous sections. If we take the limit r — oo in (4.45) (which 
holds for the conventional +J model), the left hand side reduces to the squared 
magnetization m(K, K,)?. Similarly, when K = Kp, the right hand side ap- 
proaches the square of the usual spin glass order parameter q( Kp, Kp)”. We thus 
have 


m(Kp, Kp) = q(Kp, Kp). (4.68) 


The corresponding relation for the modified +J model is 


where the subscript M denotes that the quantities are for the modified +J model. 
Another useful relation is, for general K, 


alk, Kp) = dM (K, Kp), (4.70) 


which is valid according to (4.67) because the spin glass order parameter q is 
gauge invariant. It is also not difficult to prove that 


m(Kp +a, Kp) = mm(Kp, Kp) (4.71) 
from gauge transformation. 


4.8.2 Phase diagram 

Various formulae derived in the previous subsection are useful to show the close 
relationship between the phase diagrams of the modified and conventional +J 
models. First of all, we note that the spin glass phase exists in the same region 
in both models according to (4.70). In the conventional +J model, (4.68) implies 
that q > 0 if m > 0 on the Nishimori line K = Kp. Thus there does not exist 
a spin glass phase (q > 0,m = 0) when K = Kp, and the ordered phase at low 
temperatures on the Nishimori line (the part with p > pe in Fig. 4.6(a)) should 
be the ferromagnetic phase. Accordingly, the ordered phase to the upper right 
side of the Nishimori line cannot be the spin glass phase but is the ferromagnetic 
phase. 


MODIFIED +J MODEL 65 


Fig. 4.6. Phase diagram of the conventional (a) and modified (b) +J models 


In the modified +J model, on the other hand, (4.69) holds when K = K,-+a, 
so that the lower part of the curve K = Kp +a is in the ferromagnetic phase 
(the part with p > pe in Fig. 4.6(b)). It is very plausible that the ordered phase 
to the upper right of the line K = Kp +a is the ferromagnetic phase, similar to 
the case of the conventional +J model. 

We next notice that the magnetization at the point B on the line K = Kp+a 
in the conventional +J model (Fig. 4.6(a)) is equal to that at Con K = Kp in the 
modified +J model (Fig. 4.6(b)) according to (4.71). Using the above argument 
that the ferromagnetic phase exists to the upper right of K = Kp +a in the 
modified +J model, we see that mm(Kp, Kp) > 0 at C and thus m(A,+a, Kp) > 
0 at B. If we vary a(> 0) with p fixed, B moves along the vertical line below 
K = Kp. Therefore, if m(Kp, Kp) > 0 at a point on the line K = Kp, we are 
sure to have m(K, K,) > 0 at all points below it. It is therefore concluded that 
m > 0 below the line K = Kp for all p in the range p > pe. We have already 
proved in $4.6 that there is no ferromagnetic phase in the range p < pe, and 
hence a vertical boundary at p = pe is concluded to exist in the conventional +J 
model. 

We have assumed in the above argument that the modified +J model has the 
ferromagnetic phase on the line K = Kp, which has not been proved rigorously 
to be true. However, this assumption is a very plausible one and it is quite 
reasonable to expect that the existence of a vertical boundary is valid generally. 


4.8.3 Existence of spin glass phase 


Active investigations are still being carried out concerning the existence of a spin 
glass phase as an equilibrium state in the Edwards~Anderson model (including 
the conventional +J and Gaussian models) in finite dimensions. Numerical meth- 
ods are used mainly, and it is presently believed that the Edwards~Anderson 
model with Ising spins has a spin glass phase in three and higher dimensions. In 
the modified +J model, it is possible to prove the existence of a spin glass phase 
relatively easily. Let us suppose a < 0 in the present section. 

As pointed out previously, the conventional +J and modified +J models 
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Fic. 4.7. Phase diagram of the modified +J model with a < 0 


share the same region with q > 0 (spin glass or ferromagnetic phase) in the 
phase diagram. In the modified +J model, we have mm = gu > 0 in the low- 
temperature part of the line K = Kp-+a and hence this part lies in the ferromag- 
netic phase (Fig. 4.7). It is by the way possible to prove the following inequality 
similarly to (4.50): 


(SoS) hic, | S {(S05r) K,+al bk, (4.72) 


If we set r — œ in this relation, the left hand side reduces to the squared 
magnetization on the line K = Kp in the modified +J model and the right hand 
side to an order parameter on K = Kp +a. Thus the right hand side approaches 
zero in the paramagnetic phase (where p < pm in Fig. 4.7) and consequently 
the left hand side vanishes as well. It then follows that the shaded region in the 
range pe < p < Pm in Fig. 4.7 has qm > 0, mm = 0, the spin glass phase. 

The only assumption in the above argument is the existence of the ferro- 
magnetic phase in the conventional +J model at low temperature, which has 
already been proved in two dimensions (Horiguchi and Morita 19820), and it is 
straightforward to apply the same argument to higher dimensions. Hence it has 
been proved rigorously that the modified +J model has a spin glass phase in 
two and higher dimensions. 

We note that the bond variables r are not distributed independently at each 
bond (ij) in the modified +J model in contrast to the conventional +J model. 
However, the physical properties of the modified +J model should not be essen- 
tially different from those of the conventional +J model since gauge-invariant 
quantities (the spin glass order parameter, free energy, specific heat, and so on) 
assume the same values. When a > 0, the distribution (4.66) gives larger prob- 
abilities to ferromagnetic configurations with 7; > 0 than in the conventional 
+J model, and the ferromagnetic phase tends to be enhanced. The case a < 0 
has the opposite tendency, which may be the reason for the existence of the spin 
glass phase. 
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It is nevertheless remarkable that a relatively mild modifications of the +7 
model leads to a model for which a spin glass phase is proved to exist. 


4.9 Gauge glass 

The gauge theory applies not just to the Ising models but to many other models 
(Nishimori 1981; Nishimori and Stephen 1983; Georges et al. 1985; Ozeki and 
Nishimori 1993; Nishimori 1994). We explain the idea using the example of the 
XY model with the Hamiltonian (gauge glass) 


H=-—J X cos(6j on 6; ia para (4.73) 


(ij) 


where quenched randomness exists in the phase variable yj. The case ya; = 0 
is the usual ferromagnetic XY model. The gauge theory works in this model if 
the randomly quenched phase variable follows the distribution 


1 
Pigg = mh) exp(Kp cos Xij), (4.74) 


where Jo( Kp) is the modified Bessel function for normalization. The gauge trans- 
formation in this case is 


0, 8; - bi, Xig > Xig — Gi + O;- (4.75) 


Here ġ; denotes the gauge variable arbitrarily fixed to a real value at each i. The 
Hamiltonian is gauge invariant. The distribution function (4.74) transforms as 


1 


4.9.1 Energy, specific heat, and correlation 
To evaluate the internal energy, we first write its definition 


Ng gi . 
[E] = aCe TE f dxi; exp(K COS Xij) 
EREE fy Lea oK ) cow xs 
i - Lh dd; {-J cos(0; zx 6; ied Xij) exp{ K DD cos(6; oe a; aii Xij )} 
"TT, d; exp{K > cos(0; — 0; — x1;)} 


The integration range shifts by @; after gauge transformation, which does not 
affect. the final value because the integrand is a periodic function with period 27. 
The value of the above expression therefore does not change by gauge transfor- 
mation: 


, (4.77) 


: NpJ m l oy 
[E] = e A) [ [axy exp{Kp X cos($i — Pi — Kaz) } 


(ij) 
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Jo” Tli d0; cos(8: — 0; — xiz) exp{K F cos(0; — 0; — xiz) } 
2 ii ” 7 s 
So” Tl: dh: exp{K Y cos(0; — 6; — xij)} 
Both sides of this formula do not depend on {@;}, and consequently we may 


integrate the right hand side over {¢;} from 0 to 2r and divide the result by 
(27)% to get the same value: 


NgJ 
(27) % (27Io(Kp)) ^” 
2 on 
J [[ ex / [[¢¢: exp{ Ky Y cos(¢j — Øi ~ Xij) 
0 He 0 
(ij) i 
an r~ 
o” 11, d9; cos(9; — 8; — xij) exp{ K Ð cos(0; — 0; — xij)} 
i 27 ppn 
o 11,40: exp{K 5 cos(0; — 0; — xiz)} 
If K = Kp, the denominator and numerator cancel and we find 


J 


(4.78) 


|E] = - 


. (4.79) 


[E] = - ae n 
(2r) (2r Io(K))N2 
an A Qn 
a . 
d ICEA] [fas exp{K X cos(4; — 0; — xa) b- (4.80) 
(ij) a 
Integration over Xij gives 2rIo(K) for each (ij): 
d 9 L(K) 
E] = =N —— (27)" (2r (K)? = -JN MET 
[E] (Om)N (OmIo(K))No | T) ag Cro’) 7 BIK) (4.81) 


Because the modified Bessel functions Jo( K) and (K) are not singular and 
Io(K) > 0 for positive K, we conclude, as in the case of the Ising model, that 
the internal energy has no singularity along the Nishimori line K = K, although 
it crosses the boundary between the ferromagnetic and paramagnetic phases. 

It is straightforward to evaluate an upper bound on the specific heat on the 
Nishimori line, similar to the Ising case, and the result is 


2 
THOS J Ne (3 + sea = ( al . (4.82) 


The right hand side remains finite anywhere on the Nishimori line. 

Arguments on the correlation equality and inequality work as well. The cor- 
relation function is [(cos(@;—9;)) «|, or [(exp i(6; —9;)) «|, and by using the latter 
expression and the gauge theory, we can derive the following identity: 

[(cos(@; — 0;))«] = [(cos(di — 6;)) x, (cos(@; — 9;)) x]. (4.83) 
By taking the absolute value and evaluating the upper bound, we have 
|[(cos(; = 8;))xe]] < [costó — 45) xe (4.84) 


This relation can be interpreted in the same way as in the Ising model: the 
ferromagnetic phase is not allowed to lie below the spin glass phase. 
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Fic. 4.8. Ground-state configuration of the six-site XY model with a single 
antiferromagnetic bond (bold line). The right plaquette has chirality + and 
the left —. 


4.9.2 Chirality 
The XY model has an effective degree of freedom called chirality. We can show 
that the chirality completely loses spatial correlations when K = Kp. 

This result is a consequence of the following relation 


Kali — 85 — Xij) fali — Om — Xim))) = Ka (Bi — 95 — xii PSO — Om — ae 

4.85 
for K = Kp, where (ij) and (lm) are distinct bonds and fı and fz are arbitrary 
functions with period 27. Equation (4.85) can be proved in the same way as 
we derived the exact energy (4.81) and is analogous to the decoupling of bond 
energy in the Ising model (4.21). Each factor on the right hand side of (4.85) is 
evaluated as in the previous subsection and the result is 


s do fı (A)e* cos 6 . ie dé fo(@)e* cos 6 


ss dO eK cos 0 fg d@ eK cos 4 
(4.86) 


[(f1(9i — Oy — xij) falhi — Om ~ Xim))] = 


under the condition K = Ky. 
Chirality has been introduced to quantify the degree of twistedness on the 
frustrated XY model (Villain 1977) and is defined by 


Kp = S sin(6; — 0; — Xij) (4.87) 


where the sum is over a directed path (counter-clockwise, for example) around 
a plaquette. Frustrated plaquettes are generally neighbouring to each other as 
in Fig. 4.8, and such plaquettes carry the opposite signs of chirality. It is a 
direct consequence of the general relation (4.86) that chiralities at plaquettes 
without common bonds are independent. In particular, we have [(Kp)| = 0 at 
any temperature since the sine in (4.87) is an odd function and consequently 


(Kp Kpo)] = [(Kp1)][(Kpe)] = 0 (4.88) 


if K = Kp and the plaquettes py; and pz are not adjacent to each other sharing 
a bond. 
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The complete absence of chirality correlation (4.88) is not easy to understand 
intuitively. Chiralities are in an ordered state at low temperatures in regular 
frustrated systems (Miyashita and Shiba 1984) and in a spin glass state in the 
random XY model (Kawamura and Li 1996). There is no apparent reason to 
expect the absence of chirality correlations in the present gauge glass problem. 
Further investigation is required to clarify this point. 


4.9.3. XY spin glass 


The name ‘XY spin glass’ is usually used for the XY model with random inter- 
actions: 
H = — 5 Jij cos(0; ir ae (4.89) 
(ij) 

where Jij obeys such a distribution as the +J or Gaussian function as in the 
Ising spin glass. This model is clearly different from the gauge glass (4.73). It 
is difficult to analyse this XY spin glass by gauge transformation because the 
gauge summation of the probability weight leads to a partition function of the 
Ising spin glass, which does not cancel out the partition function of the XY spin 
glass appearing in the denominator of physical quantities such as the energy. It 
is nevertheless possible to derive an interesting relation between the Ising and 
XY spin glasses using a correlation function (Nishimori 1992). 

The gauge transformation of the Ising type (4.2) reveals the following relation 
of correlation functions: 


(Si Sik |= Kook, (Si Sik l (4.90) 


where S; (= ‘(cos @;,sin 0;)) denotes the XY spin and the thermal average is 
taken with the +J XY Hamiltonian (4.89) for (---)*¥ and with the +J Ising 
model for (---)!. Taking the absolute value of both sides of (4.90) and replacing 
(S; - S;)t* on the right hand side by its upper bound 1, we find 


MSs SHE II < Iek, l (4.91) 


The right hand side vanishes for p < pe (see Fig. 4.3) in the limit |i — j| — oo 
since the Nishimori line is in the paramagnetic phase. Thus the left hand side 
also vanishes in the same range of p. This proves that pe, the lower limit of the 
ferromagnetic phase, for the XY model is higher than that for the Ising model 


Do (4.92) 


It is actually expected from the argument of §4.7 that the equality holds in 
(4.92). 

The same argument can be developed for the Gaussian case as well as other 
models like the +J/Gaussian Heisenberg spin glass whose spin variable has three 
components S = '(S,,S,,5,) under the constraint S° = 1. 
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4.10 Dynamical correlation function 

It is possible to apply the gauge theory also to non-equilibrium situations (Ozeki 
1995, 1997). We start our argument from the master equation that determines 
the time development of the probability P,(S) that a spin configuration S$ is 


realized at time t: 
dP,(S) 


dt 
Here W(S|S") is the transition probability that the state changes from S’ to 
S (# S") in unit time. Equation (4.93) means that the probability P, (S) increases 
by the amount by which the state changes into S. If W < 0, the probability 
decreases by the change of the state from S. 
An example is the kinetic [sing model 


= Trg W(S|S’)P,(S’). (4.93) 


exp(—SA(S, Ss) 

2 cosh £A(S, S’) 

exp(—4A($8”, $)) 
2 cosh EA(S, s") 


W(S|S’) = ô (S|S') 


— d(S, S') Trg dy (S”|S) 


(4.94) 


where 6,($|.$”) is a function equal to one when the difference between S and S” 
is just a single spin flip and is zero otherwise: 


51($|S’) = 6{2, 5 (1 — S;S))}. (4.95) 


In (4.94), A(S, S’) represents the energy change H(S) — H(S’). The first term 
on the right hand side of (4.94) is the contribution from the process where the 
state of the system changes from S’ to S by a single spin flip with probability 
e— 94/2 /2 cosh(GA/2). The second term is for the process where the state changes 
from § to another by a single spin flip. Since 6(.$, S”), 5;(S|$"’), and A(S, S^) are 
all gauge invariant, W is also gauge invariant: W(S|S’) = W(Sa|S’o), where 
So = {Sici}. 
The formal solution of the master equation (4.93) is 


P,(S) = Trg (Sle |S") Po(S’). (4.96) 


We prove the following relation between the dynamical correlation function and 
non-equilibrium magnetization using the above formal solution: 


Kp a 
KSSE] = [(Se(t)) xc]. (4.97) 
Here AOAO is the autocorrelation function, the expectation value of the 
spin product at site ¿ when the system was in equilibrium at t = 0 with the 


inverse temperature Kp and was developed over time to t: 


(SOSE = Trs Trg SilSle W |S") S/P,(S', Kp) (4.98) 
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P(S’, Kp) = Fey (Kp DTA (4.99) 


ZE) 


where the K(= ØJ)-dependence of the right hand side is in W. The expecta- 
tion value (S;(t))Ẹ% is the site magnetization at time t starting from the perfect 
ferromagnetic state |F): 


(Si(é)) = E 


F). (4.100) 


To prove (4.97), we perform gauge transformation (Tij —> TijOi0j, Si > Siri) 
to the configurational average of (4.100) 


j 1 : Ca r A 
[(Si(t)) ie] = O cosh K)? >. exp(Kp >A Tij)Trs 9;(S|e™ 


tW 


F). (4.101) 


Then the term e'” is transformed to (Solet |F), which is equal to (S]e ™ |o) 
by the gauge invariance of W. Hence we have 


; 1 
(Si) = (2 cosh oe JN D3 Tro exp(Kp X Tjaioj)Trs S: (Sle Joyo; 
1 7 
= Gooch K,)¥o oF S IrsTro S;(Sle™ |o)o;Z(Kp) Polo, Kp). (4.102) 


It is also possible to show that the configurational average of (4.98) is equal to 
the above expression, from which (4.97) follows immediately. 

Equation (4.97) shows that the non-equilibrium relaxation of magnetization 
starting from the perfect ferromagnetic state is equal to the configurational av- 
erage of the autocorrelation function under the initial condition of equilibrium 
state at the inverse temperature Kp. In particular, when K = Kp, the left hand 
side is the equilibrium autocorrelation function, and this identity gives a direct 
relation between equilibrium and non-equilibrium quantities. 

A generalization of (4.97) also holds: 


[(Si(tw)Si(t + twk] = [(Si(tw) Sit + tw) el: (4.103) 
Both sides are defined as follows: 
(Si(tw)Si(t + tw)) K 
on = Trg, Trg, Trg, Szil Sole A |S) S15 (Sy le tw W |So)P.(So, Kp) (4.104) 
(Siltw) Silt + tw)) 
= Trg, Trg, Szil Salt |S) SaS lef IF}. (4.105) 


Equation (4.104) is the autocorrelation function at time t after waiting for tw 
with inverse temperature K starting at time t = 0 from the initial equilibrium 
state with the inverse temperature Kp. Equation (4.105), on the other hand, 
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represents the correlation function with the perfect ferromagnetic state as the 
initial condition instead of the equilibrium state at K, in (4.104). Equation (4.97) 
is a special case of (4.103) with tw = 0. 

To prove (4.103), we first note that the gauge transformation in (4.105) yields 
o as the initial condition in this equation. Similar to the case of (4.102), we take 
a sum over ø and divide the result by 2%. We next perform gauge transformation 
and sum it up over the gauge variables in (4.104), comparison of which with the 
above result leads to (4.103). It is to be noted in this calculation that (4.105) is 
gauge invariant. 

If K = Kp in (4.103), 


[(S:(0)Si(t)) E] = [(Si(tw) Sit + tw), | (4.106) 


For K = Kp, the left hand side of (4.103) is the equilibrium autocorrelation func- 
tion and should not depend on tw, and thus we have set tw = 0. Equation (4.106) 
proves that the autocorrelation function does not depend on ty on average if it is 
measured after the system is kept at the equilibrium state with inverse tempera- 
ture Kp for time duration ty and the initial condition of a perfect ferromagnetic 
state. The aging phenomenon, in which non-equilibrium quantities depend on 
the waiting time ty before measurement, is considered to be an important char- 
acteristic of the spin glass phase (Young 1997; Miyako et al. 2000). Equation 
(4.106) indicates that the configurational average of the autocorrelation function 
with the perfect ferromagnetic state as the initial condition does not show aging. 
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(1987), Kitatani and Oguchi (1990, 1992), Ozeki (1990), Ueno and Ozeki (1991), 
Singh (1991), and Le Doussal and Harris (1988, 1989). These problems have been 
attracting resurgent interest recently as one can see in Singh and Adler (1996), 
Ozeki and Ito (1998), Sorensen et al. (1998), Gingras and Sørensen (1998), Aarão 
Reis et al. (1999), Kawashima and Aoki (2000), Mélin and Peysson (2000), Ho- 
necker et al. (2000), and Hukushima (2000). Properties of the multicritical point 
in two dimensions have been studied also from the standpoints of the quantum 
Hall effect (Cho and Fisher 1997; Senthil and Fisher 2000; Read and Ludwig 
2000) and supersymmetry (Gruzberg et al. 2001). 
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Reliable transmission of information through noisy channels plays a vital role in 
modern society, Some aspects of this problem have close formal similarities to 
the theory of spin glasses. Noise in the transmission channel can be related to 
random interactions in spin glasses and the bit sequence representing informa- 
tion corresponds to the Ising spin configuration. The replica method serves as 
a powerful tool of analysis, and TAP-like equations can be used as a practical 
implementation of the algorithm to infer the original message. The gauge theory 
also provides an interesting point of view. 


5.1 Error-correcting codes 


Information theory was initiated by Shannon half a century ago. It formulates 
various basic notions on information transmission through noisy channels and 
develops a framework to manipulate those abstract objects. We first briefly re- 
view some ideas of information theory, and then restate the basic concepts, such 
as noise, communication, and information inference, in terms of statistical me- 
chanics of spin glasses. 


5.1.1 Transmission of information 

Suppose that we wish to transmit a message (information) represented as a 
sequence of N bits from one place to another. The path for information trans- 
mission is called a (transmission) channel. A channel usually carries noise and 
the output from a channel is different in some bits from the input. We then ask 
ourselves how we can infer the original message from the noisy output. 

It would be difficult to infer which bit of the output is corrupted by noise if the 
original message itself had been fed into the channel. It is necessary to make the 
message redundant before transmission by adding extra pieces of information, 
by use of which the noise can be removed. This process is called channel coding 
(or encoding), or simply coding. The encoded message is transmitted through a 
noisy channel. The process of information retrieval from the noisy output of a 
channel using redundancy is called decoding. 

A very simple example of encoding is to repeat the same bit. sequence several 
(for instance, three) times. If the three sequences received at the end of the 
channel coincide, one may infer that there was no noise, If a specific bit has 
different values in the three sequences, one may infer its correct value (0 or 1) 
by the majority rule. For example, when the original message is (0,0, 1, 1,0) and 
the three output sequences from a noisy channel are (0,0,1, 1,0), (0, 1,1, 1,0), 
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channel 


Fic. 5.1. Parity-check code 


and (0,0,1,1,0), the correct second bit can be inferred to be 0 whereas all the 
other bits coincide to be (0, *,1,1,0). This example shows that the redundancy 
is helpful to infer the original information from the noisy message. 

A more sophisticated method is the parity-check code. For example, suppose 
that seven bits are grouped together and one counts the number of ones in the 
group. If the number is even, one adds 0 as the eighth bit (the parity bit) and 
adds 1 otherwise. Then there are always an even number of ones in the group of 
eight bits in the encoded message (code word). If the noise rate of the channel 
is not very large and at most only one bit is flipped by noise out of the eight 
received bits, one may infer that the output of the channel carries no noise when 
one finds even ones in the group of eight bits. If there are odd ones, then there 
should be some noise, implying error detection (Fig. 5.1). Error correction after 
error detection needs some further trick to be elucidated in the following sections. 


5.1.2 Similarity to spin glasses 

It is convenient to use the Ising spin +1 instead of 0 and 1 to treat the present 
problem by statistical mechanics. The basic operation on a bit sequence is the 
sum with modulo 2 (2=0 with mod 2), which corresponds to the product of Ising 
spins if one identifies 0 with S$; = 1 and 1 with S; = —1. For example, 0+1= 1 
translates into 1 x (—1) = —1 and 1 + 1 = 0 into (—1) x (—1) = 1. We hereafter 
use this identification of an Ising spin configuration with a bit sequence. 

Generation of the parity bit in the parity-check code corresponds to the 
product of appropriate spins. By identifying such a product with the interaction 
between the relevant spins, we obtain a spin system very similar to the Mattis 
model of spin glasses. 

In the Mattis model one allocates a randomly quenched Ising spin €; to each 
site, and the interaction between sites 7 and 7 is chosen to be Jj; = &€;. The 
Hamiltonian is then 

H =- Y EE 545}. (5.1) 
(ij) 
The ground state is clearly identical to the configuration of the quenched spins 
S; = & (Vi) (or its total reversal S; = —&; (Vi)), see Fig. 5.2(a). 

Returning to the problem of error-correcting codes, we form the Mattis-type 

interactions ‘Chane = &,...&,} with r an integer for appropriate combinations 
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Fic. 5.2. The ground-state spin configurations of the Mattis model without 
noise added (a), the interaction marked by a cross flipped by noise (b), and 
three interactions flipped by strong noise (c). Thin lines represent ferromag- 
netic interactions and bold lines antiferromagnetic interactions. The plaque- 
ttes marked by dots are frustrated. 


of sites {i4 ... ip}. We then feed these interactions (encoded message), instead of 
the original spin configuration € = {€;} (original message), into the noisy chan- 
nel. The encoded information is redundant because the number of interactions 
Ng (which is the number of elements in the set {(i;...i,)}) is larger than the 
number of spins N. 

For instance, the conventional r = 2 Mattis model on the two-dimensional 
square lattice has Ng = 2N interactions, the number of neighbouring sites. For 
the original interactions without noise J’ = {&€;}, the product of the J}, along 
an arbitrary closed loop c, fe = IT 3 = ][(&&;), is always unity (positive) 
since all the €; appear an even number of times in the product. Thus the Mattis 
model has no frustration. However, noise in the channel flips some elements of 
J? and therefore the output of the channel, if it is regarded as interactions of 
the spin system, includes frustration (Fig. 5.2(b)). Nevertheless the original spin 
configuration is still the ground state of such a system if the noise rate is not large 
(i.e. only a small number of bits are flipped) as exemplified in Fig. 5.2(b). It has 
thus been shown that correct inference of the original message is possible even 
if there exists a small amount of noise in the channel, as long as an appropriate 
procedure is employed in encoding and decoding the message. 


5.1.3 Shannon bound 
It is necessary to introduce redundancy appropriately to transmit information 
accurately through a noisy channel. It can indeed be proved that such redun- 
dancy should exceed a threshold so that we are able to retrieve the original 
message without errors. 

Let us define the transmission rate R of information by a channel as 


N (number of bits in the original message) 
~  Ng(number of bits in the encoded message)’ 


For smaller denominator, the redundancy is smaller and the transmission rate 
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is larger. Strictly speaking, the numerator of (5.2) should be the ‘number of 
informative bits in the original message’. For biased messages, where 0 and 1 
appear with unequal probabilities, the amount of information measured in bits 
is smaller than the length of the binary message itself. This can easily be verified 
by the extreme case of a perfectly biased message with only ones; the message 
carries no information in such a case. By multiplying (5.2) by the number of bits 
transmitted per second, one obtains the number of information bits transmitted 
per second. 
We consider a memoryless binary symmetric channel (BSC) where noise flips 
a bit from 1 to 0 and 0 to 1 independently at each bit with a given probability. 
It is known that the transmission rate should satisfy the following inequality so 
that error-free transmission is possible through the BSC in the limit of a very 
long bit sequence: 
R<C, (5.3) 


where C is a function of the noise probability and is called the channel capacity. 
The capacity of the BSC is 


C = 1 + plog p + (1 — p) log (1 — p). (5.4) 


Here p is the probability that a bit is flipped by noise. Equation (5.3) is called 
the channel coding theorem of Shannon and implies that error-free transmission 
is possible if the transmission rate does not exceed the channel capacity and 
an appropriate procedure of encoding and decoding is employed. Similar results 
hold for other types of channels such as the Gaussian channel to be introduced 
later. A sketch of the arguments leading to the channel coding theorem is given 
in Appendix C. 

An explicit example of a code that saturates the Shannon bound (5.3) asymp- 
totically as an equality is the Sourlas code (Sourlas 1989): one takes all possible 
products of r spins from N sites to form Mattis-type interactions. We have 
mainly explained the conventional Mattis model with r = 2 in §5.1.2, and we 
discuss the general case of an arbitrary r (= 2,3,4,...) hereafter. This is nothing 
but the infinite-range model with r-body interactions.” It will be shown later 
that the Shannon bound (5.3) is asymptotically achieved and the error rate ap- 
proaches zero in the Sourlas code if we take the limit N -> oo first and r — oo 
afterwards. It should be noted, however, that the inequality (5.3) reduces to an 
equality with both sides approaching zero in the Sourlas code, which means that 
the transmission rate R is zero asymptotically. Therefore the transmission is not 
very efficient. A trick to improve this point was shown recently to be to take the 
product of a limited number of combinations, not all possible combinations of r 
spins, from N sites. All of these points will be elucidated in detail later in the 
present chapter. 


a ary . . . * . . 
The symbol p is often used in the literature to denote the number of interacting spins. We 
use r instead to avoid confusion with the error probability in the BSC. 
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5.1.4 Finite-temperature decoding 


Let us return to the argument of §5.1.2 and consider the problem of inference 
of spin configurations when the noise probability is not necessarily small. It was 
shown in §5.1.2 that the ground state of the Ising spin system (with the output 
of the channel as interactions) is the true original message (spin configuration) if 
the channel is not very noisy. For larger noise, the ground state is different from 
the original spin configuration (Fig. 5.2(c)). This suggests that the original spin 
configuration is one of the excited states and thus one may be able to decode 
more accurately by searching for states at finite temperature. It will indeed be 
shown that states at a specific finite temperature Tp determined by the error 
rate p give better results under a certain criterion. This temperature Tp turns 
out to coincide with the temperature at the Nishimori line discussed in Chapter 
4 (Ruján 1993). 


5.2 Spin glass representation 


We now formulate the arguments in the previous section in a more quantitative 
form and proceed to explicit calculations (Sourlas 1989). 


5.2.1 Conditional probability 


Suppose that the Ising spin configuration € = {&;} has been generated according 
to a probability distribution function P(g). This distribution P(€) for generat- 
ing the original message is termed the prior. Our goal is to infer the original 
spin configuration from the output of a noisy channel as accurately as possible. 
Following the suggestion in the previous section, we form a set of products of r 
spins 

Pip = Sy & (= AD (5.5) 


for appropriately chosen combinations of the £; and feed the set of interactions 
into the channel. We first consider the BSC, and the output of the channel Ji... 
is flipped from the corresponding input J? ,, = ĉn .--&, with probability p and 
is equal to —€;, ...&,. The other possibility of correct output é, ...&;, has the 
probability 1 — p. The output probability of a BSC can be expressed in terms of 
a conditional probability: 


OX Goda cd Sty e Ein) 


P(Ji ai. Eiu Ein) = s 5.6 
( Tjeter En Ein) 9J cosh Bis (5 ) 
where (, is a function of p defined as 
6} 4 1 Ez 
eP = A 5.7 


p 


Equation (5.6) is equal to 1 — p when Jj,..4,. = &,..-&, according to (5.7 
and is equal to p when J;,..4,. = ~En ...&,., implying that (5.6) is the correct 
conditional probability to characterize the channel. Equations (5.6) and (5.7 
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inverse of Op defined in (5.7), denoted Tp at the end of the previous section, 
coincides with the temperature on the Nishimori line. One should note here that 
p in the present chapter is 1 — p in Chapter 4. The temperature Tp is sometimes 
called the Nishimori temperature in the literature of error-correcting codes. 

Assume that (5.6) applies to each set (i; .. . ip) independently. This is a mem- 
oryless channel in which each bit is affected by noise independently. The overall 
probability is then the product of (5.6) 


1 
H Tie) — me de ae Eg 
Pi dig) (cosh B,)¥o exp (4, ) Tie ashe ws a) , (5.8) 


where the sum in the exponent is taken over all sets (i4 ... ip) for which the spin 
products are generated by (5.5). The symbol Np is for the number of terms in 
this sum and is equal to the number of bits fed into the channel. 


5.2.2 Bayes formula 


The task is to infer the original message (spin configuration) € from the out- 
put J = {J;,..4,}. For this purpose, it is necessary to introduce the conditional 
probability of € given J, which is called the posterior. The posterior is the con- 
ditional probability with the two entries of the left hand side of (5.8), J and &, 
exchanged. The Bayes formula is useful for exchanging these two entries. 
The joint probability P(A, B) that two events A and B occur is expressed 
in terms of the product of P(B) and the conditional probability P(A|B) for A 
under the condition that B occurred. The same holds if A and B are exchanged. 
Thus we have 
P(A, B) = P(A|B)P(B) = P(B|A)P(A). (5.9) 


It follows immediately that 


P(BIA)P(A) _ _ P(BIA)P(A) 


P(A|B) = Ee gern e 5.10 
(418) = PB) Sq PAPA) on 
Equation (5.10) is the Bayes formula. 
We can express P(o|J) in terms of (5.8) using the Bayes formula: 
PaaS (5.11) 


TreP( Jie i Pio): 


We have written ø = {o1,...,0n} for dynamical variables used for decoding. 
The final decoded result. will be denoted by ¿ = {€1, a8 En}, and we reserve € = 
{€1,...,&n} for the true original configuration. Equation (5.11) is the starting 
point of our argument. 

The probability P(J|o) represents the characteristics of a memoryless BSC 
and is given in (5.8). It is therefore possible to infer the original message by (5.11) 
if the prior P(a) is known. Explicit theoretical analysis is facilitated by assuming 
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a message source that generates various messages with equal probability. This 
assumption is not necessarily unnatural in realistic situations where information 
compression before encoding usually generates a rather uniform distribution of 
zeros and ones. In such a case, P(o) can be considered a constant. The posterior 


is then . 
exp (8p 5S dioiu tae Gz.) 

TTo exp (Bp oe Joca YN Ti) i 
Since J is given and fixed in the present problem, (5.12) is nothing more than the 
Boltzmann factor of an Ising spin glass with randomly quenched interactions J. 
We have thus established a formal equivalence between the probabilistic inference 
problem of messages for a memoryless BSC and the statistical mechanics of the 
Ising spin glass. 


P(o|J) = (5.12) 


5.2.3 MAP and MPM 

Equation (5.12) is the probability distribution of the inferred spin configuration 
o, given the output of the channel J. Then the spin configuration to maximize 
(5.12) is a good candidate for the decoded (inferred) spin configuration. Maxi- 
mization of the Boltzmann factor is equivalent to the ground-state search of the 
corresponding Hamiltonian 


Hec ee ee ee re (5.13) 


This method of decoding is called the maximum a posteriori probability (MAP). 
This is the idea already explained in §5.1.2. Maximization of the conditional 
probability P(J |o) with respect to ø is equivalent to maximization of the poster- 
ior P(a|J) if the prior P(o) is uniform. The former idea is termed the maximum 
likelihood method as P(J\o) is the likelihood function of ø. 

The MAP maximizes the posterior of the whole bit sequence o. There is 
another strategy of decoding in which we focus our attention on a single bit i, 
not the whole sequence. This means we trace out all the spin variables except for a 
single g; to obtain the posterior only of o;. This process is called marginalization 
in statistics: 


_ Tre(¢oi) XP (Fp DO EE ER 


Ploi J) = - 14 

(a:l ) Tro exp (Bp 5 Ji Gn, es Ti, ) 2 ) 

We then compare P(o; = 1| J) and P(o; = —1|J) and, if the former is larger, 
we assign one to the decoded result of the ith bit (€; = 1) and assign €; = —1 


otherwise. This process is carried out for all bits, and the set of thus decoded bits 
constitutes the final result. This method is called the finite-temperature decoding 
or the maximizer of posterior marginals (MPM) and clearly gives a different 
result than the MAP (Rujan 1993; Nishimori 1993; Sourlas 1994; Iba 1999). 

It is instructive to consider the MPM from a different point of view. The 
MPM is equivalent to accepting the sign of the difference of two probabilities as 
the ith decoded bit 
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é = sgn {P(o; = 1|J) ~ P(o; = —1|J)}. (5.15) 


This may also be written as 


sn Treo Ti J % 
& = sgn Po o;,P(o;|J) | = sgn ( Sea =sgen(oi)e,- (5.16) 


opel 
Here (0;)g, is the local magnetization with (5.12) as the Boltzmann factor. Equa- 
tion (5.16) means to calculate the local magnetization at a finite temperature 
Tp = By l and assign its sign to the decoded bit. The MAP can be regarded 
as the low-temperature (large-G) limit in place of finite £p in (5.16). The MAP 
was derived as the maximizer of the posterior of the whole bit sequence, which 
has now been shown to be equivalent to the low-temperature limit of the MPM, 
finite-temperature decoding. The MPM, by contrast, maximizes the posterior 
of a single bit. We shall study the relation between these two methods in more 
detail subsequently. 


5.2.4 Gaussian channel 


It is sometimes convenient to consider channels other than the BSC. A typical 
example is a Gaussian channel. The encoded message &;, .. . éi (= +1) is fed into 
the channel as a signal of some amplitude, Jo&;, ...&;,.. The output is continu- 
ously distributed around this input with the Gaussian distribution of variance 
Bee 


1 Jig in ~ Jobiy «+ &i,)? 
Gin Sin) = g op | — TN (5.17) 


If the prior is uniform, the posterior is written using the Bayes formula as 


exp {(Jo/ I?) Ð Sis in Tin ++ Tin 
Tre exp {(Jo/J?) 3) dig gp Pty 8 F8,) 


Comparison of this equation with (5.12) shows that the posterior of the Gaussian 
channel corresponds to that of the BSC with 6, replaced by Jo/J?. We can 
therefore develop the following arguments for both of these channels almost. in 
the same way. 


P(o|J) = (5.18) 


5.3 Overlap 
5.3.1 Measure of decoding performance 


It is convenient to introduce a measure of success of decoding that represents 
the proximity of the decoded message to the original one. The decoded ith bit is 
€; = sgn(oi)g with 3 = 3, for the MPM and 8 — oo for the MAP. It sometimes 
happens that one is not aware of the noise rate p of the channel, or equivalently 
Bp. Consequently it makes sense even for the MPM to develop arguments with 
8 unspecified, which we do in the following. 
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The product of € and the corresponding original bit &, Esen(o;)a, is 1 if 
these two coincide and —1 otherwise. An appropriate strategy is to increase the 
probability that this product is equal to one. We average this product over the 
output probability of the channel P(J|€) and the prior P(&), 


M(B) = Tre X P(E) PCIE) Eisgn(oi)a, (5.19) 
J 


which is the overlap of the original and decoded messages. We have denoted the 
sum over the site variables € by Tre and the one over the bond variables J by 
J.. A better decoding would be the one that gives a larger M (8). For a uniform 
message source P(€) = 27, the average over € and J leads to the right hand 
side independent of i, 


P 1 
M() > 2N(2 cosh Bp)N® 


S Tre exp( lp 5 Jisai bi e Ei) GBB loig. (5.20) 
J 


This expression may be regarded as the average of sgn(c;)g with the weight 
proportional to Z7(3,)(€)g,,, which is essentially equivalent to the configurational 
average of the correlation function with a similar form of the weight that appears 
frequently in the gauge theory of spin glasses, for example (4.43). 

The overlap is closely related with the Hamming distance of the two mes- 
sages (the number of different bits at the corresponding positions). For closer 
messages, the overlap is larger and the Hamming distance is smaller. When the 
two messages coincide, the overlap is one and the Hamming distance is zero, 
while, for two messages completely inverted from each other, the overlap is —1 
and the Hamming distance is N. 


5.3.2 Upper bound on the overlap 
An interesting feature of the overlap M (8) is that it is a non-monotonic function 
of 8 with its maximum at 2 = Gp: 


M(8) < M(Bp). (5.21) 


In other words the MPM at the correct parameter value 6 = fp gives the opti- 
mal result in the sense of maximization of the overlap (Ruján 1993; Nishimori 
1993; Sourlas 1994; Iba 1999). The MPM is sometimes called the Bayes-optimal 
strategy. 

To prove (5.21), we first take the absolute value of both sides of (5.20) and 
exchange the absolute value operation and the sum over J to obtain 


: ` [Tre &; exp(Bp ` ee EEE E (5.22) 
J 


M(B) < — AEFT, 
M(8) < 2N (2 cosh p)” 


where we have used 
we can derive (5.21): 


sgn(o;)g| = 1. By rewriting the right hand side as follows, 


l 1 (Tre & exp( Bp X Jay cin big «= E)? 
Mig a ee A 
(8) S 2N (2 cosh 8p) Y8 3 [Tre &; expl 8p 9 Ji..inbir e Ea N 
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~ 2N(2 cosh Bp) Y” gi Ši exp(5p >, Fis ibis ++ in) 


l Tre E; exp( 8p > diy atobay kaos Si) 
[Tre & exp(Bp So Jaisa Si) 
= 2N (2 cosh 3,)N# 2 Tre & exp( 8p »; Fis .inSix ++ 8i, sgn (oi) 8, 


= M (Bp). (5.23) 


Almost the same manipulations lead to the following inequality for the Gaussian 
channel: 


M(@) = -Te f [| dJa. P( JIE) & sgn{ai)g < M (33) (5.24) 


We have shown that the bit-wise overlap defined in (5.19) is maximized by 
the MPM with the correct parameter (8 = Øp for the BSC). This is natural in 
that the MPM at 8 = fp was introduced to maximize the bit-wise (marginalized) 
posterior. The MAP maximizes the posterior of the whole bit sequence ø, but 
its probability of error for a given single bit is larger than the MPM with the 
correct parameter value. This observation is also confirmed from the viewpoint 
of Bayesian statistics (Sourlas 1994; Iba 1999). 

The inequalities (5.21) and (5.24) are essentially identical to (4.63) derived 
by the gauge theory in §4.6.4. To understand this, we note that generality is not 
lost by the assumption €; = 1 (Vi) in the calculation of M (8) for a uniform infor- 
mation source. This may be called the ferromagnetic gauge. Indeed, the gauge 
transformation Jj, ...i,. > Ji,...i,€i, ---&, and o; — o;€; in (5.19) removes £; from 
the equation. Then M defined in (5.19) is seen to be identical to (4.63) with the 
two-point correlation replaced with a single-point spin expectation value. The 
argument in $4.6.4 applies not only to two-point correlations but to any cor- 
relations, and thus the result of 84.6.4 agrees with that of the present section. 
Therefore the overlap M of the decoded bit and the original bit becomes a maxi- 
mum on the Nishimori line as a function of the decoding temperature with fixed 
error rate p. For the Gaussian channel, (5.24) corresponds to the fact that the 
Nishimori line is represented as Jo/J? = 3 as observed in §4.3.3. 


5.4 Infinite-range model 


Inequality (5.21) shows that M (£8) is a non-monotonic function, but the inequal- 
ity does not give the explicit 6-dependence. It may also happen that it is not 
easy to adjust @ exactly to Øp in practical situations where one may not know 
the noise rate p very precisely. One is then forced to estimate p but should 
always be careful of errors in the parameter estimation. It is therefore useful if 
one can estimate the effects of errors in the parameter estimation on the overlap 
M(p). 
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A solvable model, for which we can calculate the explicit form of M (£), would 


as an important prototype for this and other purposes. 


5.4.1 Infinite-range model 


The Sourlas code explained in §5.1.3 is represented as the infinite-range model. 
In the Sourlas code, the sum in the Hamiltonian 


H =- os Fig sip Mis ee ip (5.25) 


iy ra i 


runs over all possible combinations of r spins out of N spins. Then the number 
of terms is Ng = Ci This infinite-range model with r-body interactions can be 
solved explicitly by the replica method (Derrida 1981; Gross and Mézard 1984; 
Gardner 1985; Nishimori and Wong 1999). We show the solution for the Gaussian 
channel. The BSC is expected to give the same result in the thermodynamic limit 
according to the central limit theorem. 

The parameters Jo and J in the Gaussian distribution (5.17) must be scaled 
appropriately with N so that the expectation value of the infinite-range Hamil- 
tonian (5.25) is extensive (proportional to N) in the limit N — oo. If we also 
demand that physical quantities remain finite in the limit r — oo after N — oo, 
then r should also be appropriately scaled. The Gaussian distribution satisfying 
these requirements is 


Novi 1/2 Nr7} jor! 2 
P( Jig in [Sin ++ Ei) = (Fa) XP ) — Sap Ca = aes a) 
(5.26) 


where J and jo are independent of N and r. The appropriateness of (5.26) is 
justified by the expressions of various quantities to be derived below that have 
non-trivial limits as N — oo and r — oo. 


5.4.2 Replica calculations 


Following the general prescription of the replica method, we first calculate the 
configurational average of the nth power of the partition function and take the 
limit n -> 0. Order parameters naturally emerge in due course. The overlap M 
is expressed as a function of these order parameters. 

The configurational average of the nth power of the partition function of the 
infinite-range model is written for the uniform prior (P = 274) as 


arate f TE ui PEPE) 2" 
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‘Trg exp (> yaaa ee wat), (5.27) 
ns or & 


where e (= 1,...,n) is the replica index. A gauge transformation 


die Od en Gee e- Ein Oi Cie, (5.28) 


in (5.27) removes € from the integrand. The problem is thus equivalent to the 
case of €; = 1 (Vi), the ferromagnetic gauge. We mainly use this gauge in the 
present chapter. The sum over £ in (5.27) then simply gives 2^, and the factor 
27N in front of the whole expression disappears. 

It is straightforward to carry out the Gaussian integral in (5.27). If we ignore 
the trivial overall constant and terms of lower order in N, the result is 


2") 


+ jo8N >> (3 > t) | (5.29) 


Here Tro is the sum over ø. In deriving the final expression, we have used a 
relation that generalizes the following relation to the r-body case: 
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It is convenient to introduce the variables 


i : 1 , 
as = HF ` ea, Ma = ng 5 oF (5.31) 
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to replace the expressions inside the parentheses in the final expression of (5.29) 
by dag and Ma so that we can carry out the Tr, operation. The condition to sat- 
isfy (5.31) is imposed by the Fourier-transformed expressions of delta functions 
with integration variables gag and Ma: 


BJEN ; 
Vag ca I I] ddasddag f [[dmadrie exp = S (aa)? 


a<f a<g 
— N y Jo3Iae + N >D das (i N 2% ‘oF + job N 2 Ma)” 
acpB ace 
1 I 
-NY maña t+ NV the | = a BJ Nn y. 5.32 
2 „a + 2i ( 2% + 7 i (5.32) 


We can now operate Trg independently at each i to find 


B? 2N 
Zr] = cp I] dgaeddag | [ [amada exp — : ——— Saas)" 


ace af 


ody 5 dagfa + ~ PP Nn + jo BN 2 Ma) 
ace 


— N 2 Mala + N log Tr exp 5 Gapa%o" + 5 Mao . (5.33) 
ace a 


Here Tr denotes sums over single-site replica spins {o!,...,0"}. 

5.4.3 Replica-symmetric solution 

Further calculations are possible under the assumption of replica symmetry: 
q= dea, Î = daa; M= Me, M = Ma. (5.34) 


We fix n and take the thermodynamic limit N — oo to evaluate the integral by 
steepest descent. The result is 


l anin 1), a(n — 1 : 1 . 
[Z"| = exp |x {er ae lj mu daa joBnm” — nmm + ri 


n 
+lo Tr f Du exp ĝu a% +m o“ — —ĝ , 5.35 
i i (va rae or- ha | (5.35) 


2 je er : 
where Du = e~" /?du//2r. This u has been introduced to reduce the double sum 
over a and  (}/ <4) to a single sum. The trace operation can be performed for 
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each replica independently so that the free energy 3f defined by [Z"] = e~N"9F 
becomes in the limit n — 0 


Ih tee | a x 
3f = e an + 544 + Bjom” — mm 


1 > Il i 
+ rae — rt + J Du log 2cosh(\/@u + 12). (5.36) 


The equations of state for the order parameters are determined by the saddle- 
point condition. By variation of (5.36), we obtain 


e S E E ae Lee 
a= a a m = Bjorm™} (5.37) 


q = I Du tanh?(yĝu +m), m= J Du tanh(yĝu + ñ). (5.38) 


Eliminating ĝ and m from (5.38) using (5.37), we can write the equations for q 
and m in closed form: 


q= f Du tanh? 6G, m= | Du tanh 6G, (5.39) 
where 
e , r—i 
G=J a Ut jorm (5.40) 


These reduce to the equations of state for the conventional SK model, (2.28) and 
(2.30), when r = 2. One should remember that 2j9 here corresponds to Jo in the 
conventional notation of the SK model as is verified from (5.27) with r = 2. 


5.4.4 Overlap 


The next task is to derive the expression of the overlap M. An argument similar 
to §2.2.5 leads to formulae expressing the physical meaning of q and m: 


a= |(oPo?)| = [fo], m= kof) = Koi]. (5.41) 


Comparison of (5.41) and (5.39) suggests that tanh? G(-) and tanh ((-) in the 
integrands of the latter may be closely related with (o;)* and (o;) in the former. 
To confirm this, we add h 3°; ota? to the final exponent of (5.27) and follow 
the calculations of the previous section to find a term ha%o” in the exponent 
of the integrand of (5.35). We then differentiate —Gnf with respect to h and 
let n — 0,h — 0 to find that ¢® and o? are singled out in the trace operation 
of replica spins, leading to the factor tanh? G(-). A similar argument using an 
external field term h 5°, 0% and differentiation by h leads to tanh {(-). 
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It should now be clear that the additional external field with the product of 
k spins, h}; oto. ..., yields 


ay = I Du tanh” 6G. (5.42) 


Thus, for an arbitrary function F(x) that can be expanded around x = 0, the 
following identity holds: 


[F((o;)a)] = / Du F (tanh 3G). (5.43) 


The overlap is, in the ferromagnetic gauge, M(@) = [sgn(o;)a]. If we therefore 
take as F(x) a function that approaches sgn(x) (e.g. tanh(ax) with a — oo), we 
obtain the desired relation for the overlap: 


M(8) = [sgn(oi) a] = [ou sgn G. (5.44) 


It has thus been established that M (8) is determined as a function of g and m 
through G. 


5.5 Replica symmetry breaking 

The system (5.25) and (5.26) in the ferromagnetic gauge is a spin glass model 
with r-body interactions. It is natural to go further to investigate the properties 
of the RSB solution (Derrida 1981; Gross and Mézard 1984; Gardner 1985; Nishi- 
mori and Wong 1999; Gillin et al. 2001). We shall show that, for small values 
of the centre of distribution jọ, a 1RSB phase appears after the RS paramag- 
netic phase as the temperature is lowered. A full RSB phase follows at still lower 
temperature. For larger jo, paramagnetic phase, ferromagnetic phase without 
RSB, and then ferromagnetic phase with RSB phases appear sequentially as the 
temperature is decreased. 


5.5.1 First-step RSB 


The free energy with IRSB can be derived following the method described in 
§3.2: 


1 1 . f 
ES E a 34090 A z0 ~ £)Gigi + Bjom" 
ee ee | Sn: Ieee N 
— 42? Sag — GL — 2) Pat + GPP — oa 


1 
+ z fo log f Dv cosh” (7 + / G0 ut+/di1 —Gov) +log2, (5.45) 


where g (0 < æ < 1) is the boundary of qo and q in the matrix block, denoted 
by mı in $3.2. Extremization of (5.45) with respect to go, q1, Go, G1, m, M, x leads 


to the equations of state. For m, go, qı, 
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q ca 1 lay a 1 lay a 
m= Bjorm™*, f= 5h I td eS el Ta (5.46) 


Elimination of M, ĝo, ĝı using (5.46) from the equations of extremization with 
respect to go,gi, 7m leads to 


aa f Du cosh” 6G, tanh 6G, 
f oo f Dv cosh” BG, Tg 
B PE. f Dv cosh” GG, tanh BG, a 
cial J] on ( f Du cosh” BG, (9:48) 
f Dv cosh” 6G, tanh? 6G, 
a, [rm Jf Dv cosh” BG, oe 


pond 
Gi = H —— R u+ uray (a7! — ah!) u + jorm™. (5.50) 


These equations coincide with (3.32)~(3.34) for r = 2 if we set h = 0 and 2jo = 
Jo. The equation of extremization by x does not have an intuitively appealing 
compact form so that we omit it here. 

The AT stability condition of the RS solution is 


DT 


2 yA r 
Tol > J f Puson pG. (5.51) 


The criterion of stability for 1RSB is expressed similarly. For the replica pair 
(a3) in the same diagonal block, the stability condition for small deviations of 
das and gag from 1RSB is 


ag 5 > fou J Du cosh*~* 8G, 


r(r—1) > -f Ducosh* BG, oe 


Further steps of RSB take place with the diagonal blocks breaking up into smaller 
diagonal and off-diagonal blocks. Thus it is sufficient to check the intra-block 
stability condition (5.52) only. 


5.5.2 Random energy model 
Tne oat in the ie r — oo is chien as ae nae energy. ean a 


EPS none is chas ee ined wml IRSB 

As its name suggests, the REM has an independent distribution of energy. 
Let us demonstrate this fact for the case of jọ = 0. The probability that the 
system has energy E will be denoted by P(E), 


P(E) = (5(E — H(o))). (5.53) 


The average [---] over the distribution of J, (5.26), can be carried out if we 
express the delta function by Fourier transformation. The result is 
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5 1 E? 


The simultaneous distribution function of the energy values E and Es of two 
independent spin configurations 0“) and oa?) with the same set of interactions 
can be derived similarly: 


P(E, E2) = (Bt — H(o"))5(B: — H(o))| 


STONE (= colina E ets 
ACAV ET DLETA 2N(1+q")J2 2N(1—4Qq")J? 
_i (1)_@) f 
1=% 2% a (5.55) 


It is easy to see in the limit r — co that 
P(E, E2) — P(E) P(E2), (5.56) 


which implies independence of the energy distributions of two spin configurations. 
Similar arguments hold for three (and more) energy values. 

The number of states with energy E is, according to the independence of 
energy levels, 


2 
n(E) = 2% P(E) = ase fios? — (<3) . (5.57) 


This expression shows that there are very many energy levels for |E| < NJ./log 2 
= Eo but none in the other range |E| > Epo in the limit N — oo. The entropy 
for |E| < Ep is 


EN? 
S(E) =N |log2 — (a3) | (5.58) 
We then have, from dS/dE = 1/T, 
E = L (5.59) 
The free energy is therefore 
~T log 2 — — (T > Te) (5.60) 


wda AEST 


where T../J = (2,\/log 2) ~}. Equation (5.60) indicates that there is a phase tran- 
sition at T = T, and the system freezes out completely (S = 0) below the 
transition temperature. 
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5.5.3 Replica solution in the limit r — oo 


It is instructive to rederive the results of the previous subsection by the replica 
method. We first discuss the case jp = 0. It is quite reasonable to expect no RSB 
in the paramagnetic (P) phase at high temperature. We thus set q = ĝ = m = 
m = 0 in (5.36) to obtain 
p = —T log2 Ze 5.61 
fe = -T log2 — qF (5.61) 
This agrees with (5.60) for T > To. 

It is necessary to introduce RSB in the spin glass (SG) phase. We try 1RSB 
and confirm that the result agrees with that of the previous subsection. For the 
1RSB to be non-trivial (i.e. different from the RS), it is required that go <q < 1 
and ĝo < ĝi. Then, if qy < 1, we find go = g; = 0 in the limit r — oo from (5.46). 
We therefore have qı = 1, and q, = 67J*r/2 from (5.46). Then, in (5.48), we find 
Gy =a Jr /2v for r > 1 and the v-integral in the numerator vanishes, leading 
to go = 0. Hence ĝo = 0 from (5.46). From these results, the free energy (5.45) 
in the limit r — œ is 


BJ. 1 | 
-8f = i £+ m log 2. (5.62) 
Variation with respect to x gives 
(8J)? = 4log 2. (5.63) 


The highest. temperature satisfying this equation is the following one for x = 1: 
Te 1 


J ~ aoe? ie 


and therefore, for T < Te, 

fsa = —J Vlog 2. (5.65) 
This agrees with (5.60), which confirms that 1RSB is exact in the temperature 
range T < Te. It is also possible to show by explicit k-step RSB calculations that 
the solution reduces to the LRSB for any k (> 1) (Gross and Mézard 1984). 

It is easy to confirm that x < 1 for T < T, from (5.63); generally, x = T/T. 
The order parameter function q(x) is equal to 1(= qı) above x (= T/T.) and 
0 (= qo) below x (see Fig. 5.3). 

For jo exceeding some critical value, a ferromagnetic (F) phase exists. It is 
easy to confirm that the solution of (5.48), (5.49), and (5.47) in the limit r — oo 
is qo = qı = m = 1 when jo > 0,m > 0. The ferromagnetic phase is therefore 
replica symmetric since go = qı. The free energy (5.36) is then 


Jr = —Jo. (5.66) 


The phase boundaries between these three phases are obtained by comparison 
of the free energies: 
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q(x) 


0 T/T, ! 


Fic. 5.3. Spin glass order parameter of the REM below the transition temper- 
ature 


kg T/J 


0 vlog 2 Jo 


Fic. 5.4. Phase diagram of the REM. The Nishimori line is shown dashed. 


The final phase diagram is depicted in Fig. 5.4. 

Let us now turn to the interpretation of the above results in terms of error- 
correcting codes. The overlap M is one in the ferromagnetic phase (jo/J > 
Vlog 2) because the spin alignment is perfect (m = 1), implying error-free de- 
coding.® To see the relation of this result to the Shannon bound (5.3), we first 
note that the transmission rate of information by the Sourlas code is 


N i 
T 
As shown in Appendix C, the capacity of the Gaussian channel is 


Remember that we are using the ferromagnetic gauge. 
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1 Jg a 
C= 5 logs (1 + z) ; (5.68) 


Here we substitute Jo = jor!/N’~! and J? — J?r!/2N"7} according to (5.26) 
and take the limit N > 1 with r fixed to find 


Big do 
JNT log 2° 


(5.69) 


The transmission rate (5.67), on the other hand, reduces in the same limit to 


r! 
Re KT: (5.70) 
It has thus been established that the transmission rate R coincides with the 
channel capacity C at the lower limit of the ferromagnetic phase jo/J = log 2. 
In the context of error-correcting codes, jo represents the signal amplitude and J 
is the amplitude of the noise, and hence jo/J corresponds to the S/N ratio. The 
conclusion is that the Sourlas code in the limit r — oo, equivalent to the REM, 
is capable of error-free decoding (m = 1,M = 1) for the S/N ratio exceeding 
some critical value and the Shannon bound is achieved at this critical value. 

The general inequality (5.21) is of course satisfied. Both sides vanish if jo < 
(joje. For jo > (joje, the right hand side is one while the left hand side is zero 
in the paramagnetic phase and one in the ferromagnetic phase. In other words, 
the Sourlas code in the limit r — oo makes it possible to transmit information 
without errors under the MAP as well as under the MPM. An important point is 
that the information transmission rate R is vanishingly small, impeding practical 
usefulness of this code. 

The Nishimori line 8J? = Jo is in the present case T/J = J/(2jo) and passes 
through the point at jo/J = /log2 and T/J = 1/2,/log2 where three phases 
(P, SG, F) coexist. The exact energy on it, E = ~—jo, derived from the gauge 
theory agrees with the above answer (5.66). One should remember here that the 
free energy coincides with the energy as the entropy vanishes. 


5.5.4 Solution for finite r 


It is necessary to solve the equations of state numerically for general finite r.” 
The result for the case of r = 3 is shown here as an example (Nishimori and Wong 
1999; Gillin et al. 2001). If jo is close to zero, one finds a 1RSB SG solution with 
qı > 0 and qo = m = 0 below T = 0.651. As the temperature is lowered, the 
stability condition of the LRSB (5.52) breaks down at T = 0.240J, and the full 
RSB takes over. 

The ferromagnetic phase is RS (5.39) in the high-temperature range but 
the RSB should be taken into account below the AT line (3.22); a mixed (M) 
phase with both ferromagnetic order and RSB exists at low temperatures. The 


“Expansions from the large-r limit and from r = 2 are also possible (Gardner 1985). 
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Fic. 5.5. Phase diagram of the model with r = 3. The double dotted line 
indicates the limit of metastability (spinodal) of the ferromagnetic phase. 
Error correction is possible to the right of this boundary. Thermodynamic 
phase boundaries are drawn in full lines. The Nishimori line is drawn dashed. 
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Fic. 5.6. Overlap for r = 3, jo = 0.77 


ferromagnetic phase, with RS and/or RSB, continues to exist beyond the limit 
of thermodynamic stability as a metastable state (i.e. as a local minimum of the 
free energy). Figure 5.5 summarizes the result. 

Dependence of the overlap 14(3) on T (= 1/8) is depicted in Fig. 5.6 where 
jo/J is fixed to 0.77 close to the boundary of the ferromagnetic phase. The 
overlap M(3) is a maximum at the optimal temperature T/J = J/2jo = 0.649 
appearing on the right hand side of (5.24) corresponding to the Nishimori line 
(the dot in Fig. 5.6). The ferromagnetic phase disappears above T/J = 0.95 even 
as a metastable state and one has M = 0: the paramagnetic phase has (0;)g = 0 
at each site i, and sgn(a;)g cannot be defined. It is impossible to decode the 
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message there. For temperatures below T/J = 0.43, the RSB is observed (the 
dashed part). We have thus clarified in this example how much the decoded 
message agrees with the original one as the temperature is changed around the 
optimal value. 


5.6 Codes with finite connectivity 


The Sourlas code saturates the Shannon bound asymptotically with vanishing 
transmission rate. A mean-field model with finite connectivity has a more desir- 
able property that the rate is finite yet the Shannon bound is achieved. We state 
some of the important results about this model in the present section. We refer 
the reader to the original papers cited in the text for details of the calculations. 


5.6.1 Sourlas-type code with finite connectivity 


The starting point is analogous to the Sourlas code described by the Hamiltonian 
(5.25) but with diluted binary interactions for the BSC, 


H=- 5 Aiaia tigt Tin e Tin -F$ o (5.71) 
i 


dg ee dp 


where the element of the symmetric tensor Aj, i (representing dilution) is either 
zero or one depending on the set of indices (i1,i2,...,%,). The final term has 
been added to be prepared for biased messages in which 1 may appear more 
frequently than —1 (or vice versa). The connectivity is c; there are ¢ non-zero 
elements randomly chosen for any given site index i: 


y Aguri (5.72) 


The code rate is R = r/c because an encoded message has c bits per index i and 
carries r bits of the original information. 

Using the methods developed for diluted spin glasses (Wong and Sherrington 
1988), one can calculate the free energy under the RS ansatz as (Kabashima and 
Saad 1998, 1999; Vicente et al. 1999) 


» i T 
pf RS) = - log cosh 8 + = J Ue n(x) |log| 1+ tanh 8J I] tanh Ga; 


j=l 7 
— E T dady m(x)7(y) log(1 + tanh Bx tanh Gy) — e J dy (y) log cosh By 
+. f I] dy: îy) | log | 2 cosh(8 > y; + GFE) ; (5.73) 


l=1 j £ 


Here |---| and |---]¢ denote the configurational averages over the distributions of 
J and £, respectively. The order functions m(x) and #(y) represent distributions 
of the multi-replica spin overlap and its conjugate: 
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Fic. 5.7. Finite-temperature phase diagram of the unbiased diluted Sourlas 
code with R = 1/4 in the limit r,c — oo (Vicente et al. 1999). The Shannon 
bound is achieved at pe. The ferromagnetic phase is stable at least above the 
dashed line. (Copyright 1999 by the American Physical Society) 


dap. = af de n(x) tanh! Ba, Gab. =À fay î(y) tanh! By, (5.74) 


where a and å are normalization constants, and | is the number of replica indices 
on the left hand side. Extremization of the free energy gives paramagnetic and 
ferromagnetic solutions for the order functions. The spin glass solution should be 
treated with more care under the 1RSB scheme. The result becomes relatively 
simple in the limit where r and c tend to infinity with the ratio R = r/c (= 1/a) 
kept finite and F = 0: 


fp = —T (a log cosh 8 + log 2) 
fe = —a(1 — 2p) (5.75) 
firsp-sq = —Ty(a log cosh G, + log 2), 


where p is the noise probability of the BSC, and T; is determined by the condition 
of vanishing paramagnetic entropy, a(log cosh Gy — By tanh 8g) + log 2 = 0. The 


The finite-temperature phase diagram for a given R is depicted in Fig. 5.7. Perfect 
decoding (m = M = 1) is possible in the ferromagnetic phase that extends to 
the limit pe. It can be verified by equating fr and fipsp—sq that the Shannon 
bound is achieved at pe, 


R = 1 + plogas p + (1 — p) loga(1 — p) (p = pe). (5.76) 
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Fic. 5.8. Ground-state magnetization as a function of the noise probability for 
various r (written as K in the figure) (Vicente et al. 1999). The rate R is 
1/2 and F = 0. Also shown by open circles are the numerical results from 
the TAP-like decoding algorithm. (Copyright 1999 by the American Physical 
Society) 


The ferromagnetic solution appears as a metastable state at a very high tem- 
perature of O(r/logr), but the thermodynamic transition takes place at T of 
O(1). This suggests that there exists a high energy barrier between the ferromag- 
netic and paramagnetic solutions. Consequently, it might be difficult to reach the 
correctly decoded (i.e. ferromagnetic) state starting from an arbitrary initial con- 
dition (which is almost surely a paramagnetic state) by some decoding algorithm. 
We are therefore lead to consider moderate r, in which case the ferromagnetic 
phase would have a larger basin of attraction although we have to sacrifice the 
final quality of the decoded result (magnetization smaller than unity). In Fig. 
5.8 the ground-state magnetization (overlap) is shown as a function of the noise 
probability for various finite values of r (written as K in the figure) in the case 
of R = 1/2. The transition is of first order except for r = 2. It can be seen that 
the decoded result is very good (m close to one) for moderate values of r and p. 

It is useful to devise a practical algorithm of decoding, given the channel 
output {J,}, where u denotes an appropriate combination of site indices. The 
following method based on an iterative solution of TAP-like equations is a pow- 
erful tool for this purpose (Kabashima and Saad 2001; Saad et al. 2001) since its 
computational requirement is only of O(N). For a given site i and an interaction 
u that includes i, one considers a set of conditional probabilities 


1+ Mpi Py 
2 


P(oil{Jpéu}) = » P(Julow {Jv} = const - (1+ mrih (5-77) 


where v also includes i. Under an approximate mean-field-like decoupling of the 


We did not write out the normalization constant in the second expression of (5.77) because 
the left hand side is to be normalized with respect to J, in contrast to the first expression to 
be normalized for o;. 
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conditional probabilities, one obtains the following set of equations for m,; and 
Moye 


Myst = tanh > tanh! ni + BF 
veEM(i)\u (5.78) 
Mpi = tanh GJ, - I] Mul, 
lEL Cu) \A 


where M(i) is a set of interactions that include i, and L(js) is a set of sites 
connected by J,. After iteratively solving these equations for mpi and mMyi, one 
determines the final decoded result of the ith bit as sgn(m;), where 


m; = tanh + tanh”! mni + BF |. (5.79) 
vEM (i) 

This method is equivalent to the technique of belief propagation used in infor- 
mation theory. It is also called a TAP approach in the statistical mechanics 
literature owing to its similarity to the TAP equations in the sense that mp and 
My; reflect the effects of removal of a bond u from the system. 

The resulting numerical data are shown in Fig. 5.8. One can see satisfactory 
agreement with the replica solution. It also turns out that the basin of attraction 
of the ferromagnetic solution is very large for r = 2 but not for r > 3. 


5.6.2 Low-density parity-check code 


Statistical-mechanical analysis is applicable also to other codes that are ac- 
tively investigated from the viewpoint of information theory. We explain the 
low-density parity-check code (LDPC) here because of its formal similarity to 
the diluted Sourlas code treated in the previous subsection (Kabashima et al. 
2000a; Murayama et al. 2000). In statistical mechanics terms, the LDPC is a 
diluted many-body Mattis model in an external field.® 

Let us start the argument with the definition of the code in terms of a Boolean 
representation (0 and 1, instead of +1). The original message of length N is 
denoted by an N-dimensional Boolean vector € and the encoded message of 
length M by zo. The latter is generated from the former using two sparse matrices 
C, and Ch according to the following modulo-2 operation of Boolean numbers: 


m= CE. (5.80) 


The matrix C; has the size M x N and the number of ones per row is r and that 
per column is c, located at random. Similarly, Cn is M x M and has l ones per 


There are several variations of the LDPC. We treat in this section the one discussed by 
MacKay and Neal (1997) and MacKay (1999). See also Vicente et al. (2000) for a slightly 
different. code. 
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row and column randomly. The channel noise Ç is added to zo, and the output 
is 
z = 2+. (5.81) 


Decoding is carried out by multiplying z by Cn: 
Cnz = Cnzo + Cn = Cs + Cre. (5.82) 


One finds the most probable solution of this equation for the decoded message 
o and the inferred noise T 


Cea + Car = CE + COG. (5.83) 


The Ising spin representation corresponding to the above prescription of the 
LDPC, in particular (5.83), is 


lI Ti lI ge JI £i I] Cr. (2 Ja); (5.84) 


i€ls(u) jELu (pz) tL (ft) JEL (pu) 


where £,(j4) is a set of indices of non-zero elements in the pth row of C, and 
similarly for £,,(u). Note that o,7,€, and Ç are all Ising variables (+1) from 
now on. The Hamiltonian reflects the constraint (5.84) as well as the bias in the 
original message and the channel noise: 


H = X Airian Goode inji Gt Tj a T] 
i j 


Here A is a sparse tensor for choosing the appropriate combination of indices 
corresponding to Cs and Cn (or £La(u) and £s(u)), F is the bias of the original 
message, and Fh = $ log( 1 — p)/p comes from the channel noise of rate p. The 
interaction Ji ...i,;j,...j, is specified by the expressions involving é and ¢ in (5.84). 
The problem is to find the ground state of this Hamiltonian to satisfy (5.84), 
given the output of the channel {/J,,} defined in (5.84). 

The replica analysis of the present system works similarly to the diluted 
Sourlas code. The resulting RS free energy at T = 0 is 


f= ~ log 2 + c | dedi n(x)t(Z) log(1 + rê) + = J dydy p(y) p(y) log(1 + yy) 
l 


= = f II da, (xp) I] dymp(Ym) log (: + IE I] in| 
k= 1 k 


m= | m 


~ f Il dizi (Lr) hos (es Jfa +êp) te Ps Jfa _ o) 
k k 


kz £ 
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(a) (b) (c) 


Fic. 5.9. Magnetization as a function of the channel-error probability in the 
LDPC (Murayama et al. 2000). Bold lines represent stable states. (a) r > 3 
or l >3,r> 1. (b) r=l = 2. (c) r = 1. (Copyright 2000 by the American 
Physical Society) 


l 
c OROPESA ! g ZIR . O 
S / [I 49mm) |log | eS [ [0 + Gm) +e [C - 9m)] | - (5-86) 


“ mæl m m ¢ 


The order functions m(x) and (ê) denote the distributions of the multi-replica 
overlaps and their conjugates for the o-spins, and p(y) and p(y) are for the 
T-spins: 


daf..y = a, | dx m(x)z', daB...y = ay fai nae 


(5.87) 
ToB..y = Op i dy p(y)y’, Fap. = Ge iy dg ôv. 
Extremization of the free energy (5.86) with respect to these order functions 
yields ferromagnetic and paramagnetic solutions. Since the interactions in the 
Hamiltonian (5.85) are of Mattis type without frustration, there is no spin glass 
phase. When r > 3 or L > 3,r > 1, the free energy for an unbiased message 
(F; = 0) is 


$ 


R log 2 cosh Fn. (5.88) 


fe= — ahr tanh Fn, fp= 5 log 2 — log 2 — 
The spin alignment is perfect (m = 1) in the ferromagnetic phase. The magne- 
tization as a function of the channel-error probability p is shown in Fig. 5.9(a). 
The ferromagnetic state has a lower free energy below pe that coincides with 
the Shannon bound as can be verified by equating fr and fp. The paramagnetic 
solution loses its significance below pe because its entropy is negative in this 
region. A serious drawback is that the basin of attraction of the ferromagnetic 
state is quite small in the present case. 

If r = 1 = 2, the magnetization behaves as in Fig. 5.9(b). The perfect fer- 
romagnetic state and its reversal are the only solutions below a threshold ps. 
Any initial state converges to this perfect state under an appropriate decoding 
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algorithm. Thus the code with r = l = 2 is quite useful practically although the 
threshold p, lies below the Shannon bound. 

The system with single-body interactions r = 1 has the magnetization as 
shown in Fig. 5.9(c). Again, the Shannon bound is not saturated, but the perfect 
ferromagnetic state is the only solution below p,. An advantage of the present 
case is that there is no mirror image (m = —1). 

Iterative solutions using TAP-like equations work also in the LDPC as a 
rapidly converging tool for decoding (Kabashima and Saad 2001; Saad et al. 
2001). These equations have similar forms to (5.78) but with two types of pa- 
rameters, one for the o-spins and the other for r. Iterative numerical solutions 
of these equations for given dilute matrices C; and Cn show excellent agreement 
with the replica predictions. 


5.6.3 Cryptography 

The LDPC is also useful in public-key cryptography (Kabashima et al. 20006). The 
N-dimensional Boolean plaintext € is encrypted to an M-dimensional ciphertext 
z by the public key G = C7!C,D (where D is an arbitrary invertible dense 
matrix of size N x N) and the noise ¢ with probability p according to (5.81) 


z=GE+C. (5.89) 


Only the authorized user has the knowledge of Cn, Ce, and D separately, not just 
the product G. The authorized user then carries out the process of decryption 
equivalent to the decoding of the LDPC to infer DE and consequently the original 
plaintext €. This user succeeds if r = | = 2 and p < ps as was discussed in the 
previous subsection. 

The task of decomposing G into Cn, Cs, and D is NP complete!® and is very 
difficult for an unauthorized user, who is therefore forced to find the ground state 
of the Hamiltonian, which is the Ising spin representation of (5.89): 


H=- 5 Gizi Jit Fia OAC m TF 5 Ci, (5.90) 
i 


where G is a dense tensor with elements 1 or 0 corresponding to G, and J is 
either 1 or —1 according to the noise added as ¢ in the Boolean representation 
(5.89). Thus the system is frustrated. For large N, the number r’ in the above 
Hamiltonian and c’ (the connectivity of the system described by (5.90)) tend 
to infinity (but are smaller than N itself) with the ratio c’/r’ kept finite. The 
problem is thus equivalent to the Sourlas-type code in the same limit. We know, 
as mentioned in §5.6.1, that the basin of attraction of the correctly decrypted 
state in such a system is very narrow. Therefore the unauthorized user almost 
surely fails to decrypt. 

This system of cryptography has the advantage that it allows for relatively 
high values of p, and thus an increased tolerance against noise in comparison 


10See Chapter 9 for elucidation of the term ‘NP completeness’. 
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with existing systems. The computational requirement for decryption is of O(N), 
which is much better than some of the commonly used methods. 


5.7 Convolutional code 


The convolutional code corresponds to a one-dimensional spin glass and plays 
important roles in practical applications. It also has direct relevance to the turbo 
code, to be elucidated in the next section, which is rapidly becoming the standard 
in practical scenes owing to its high capability of error correction. We explain 
the convolutional code and its decoding from a statistical-mechanical point of 
view following Montanari and Sourlas (2000). 


5.7.1 Definition and examples 


In a convolutional code, one first transforms the original message sequence € = 
{€1,...,En} (& = +1, Vi) into a register sequence r = {11(€),...,7n(E)} 
(7 = +1, Vi). In the non-recursive convolutional code, the register sequence 
coincides with the message sequence (7; = &, Vi), but this is not the case in 
the recursive convolutional code to be explained later in $5.7.3. To encode the 
message, one prepares r registers, the state of which at time ¢ is described by 
E(t), Ne(t),...,£,(t).1' The number r is called the memory order of the code. 


The register sequence T is fed into the register sequentially (shift register): 


E(t -+ 1) = X(t) == Tt 
olt + 1) = E(t) E Th] 
(5.91) 


” 


SG) Sa) jae 


The encoder thus carries the information of (r + 1) bits ™%,74~1,...,7#-r at any 
moment t. 

We restrict ourselves to the convolutional code with rate R = 1/2 for sim- 
plicity. Code words J = { asks De use | are generated from the 
register bits by the rule 


J =| Gee: (5.92) 
= 


Here, a = 1 or 2, and we define 7; = 1 for j < 0. The superscript K(j;q@) is 
either 0 or 1 and characterizes a specific code. We define «(0;1) = «(0;2) = 1 
to remove ambiguities in code construction. Two simple examples will be used 
frequently to illustrate the idea: 


11 The original source message is assumed to be generated sequentially from i = 1 to i =N., 
Jonsequently the time step denoted by t(= 1,2,...,N) is identified with the bit number 
its 1,2,...,N). 
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Fic. 5.10. A convolutional code with code rate 1/2 (example 2 in the text) 
expressed as a spin system. Interactions exist among three spins around each 
triangle and between two horizontally neighbouring spins. Two up spins are 
located at i = —1 and i = 0 to fix the initial condition. 


1. «(0;1) = «(1;1) = 1, and the other «(j;1) = 0; K(0;2) = 1, and the other 
k(j;2) = 0. The memory order is r = 1. The code words are J® = TiTi] 
and je = Ti. The corresponding spin Hamiltonian is 


N N 
H=- Mooi -Y Ia, (5.93) 
i=] 


i=1 


where To is the noisy version of J? and g; is the dynamical variable 
used for decoding. This is a one-dimensional spin system with random 
interactions and random fields. 

2. K(0;1) = «(1;1) = «(2;1) = 1, and the other «(j;1) = 0; «(0;2) = 
K(2;2) = 1, and the other «(j;2) = 0. The memory order is r = 2 and 
the code words are J® = TiTi—1Ti-2 and J?) == 7;71-2. There are three- 
body and two-body interactions in the corresponding spin system 


N N 
H=-— 5 I oiii- — 5 I”) eisio, (5.94) 
dzs qz] 


which can be regarded as a system of ladder-like structures shown in Fig. 
5.10. A diagrammatic representation of the encoder is depicted in Fig. 5.11. 


5.7.2 Generating polynomials 

Exposition of the encoding procedure in terms of the Boolean (0 or 1) repre- 
sentation instead of the binary (+1) representation is useful to introduce the 
recursive convolutional code in the next subsection. For this purpose, we express 
the original message sequence E = {&,...,&n} by its generating polynomial 
defined as 


N 
EAC) eae See ore (5.95) 
jel 


where H;(= 0 or 1) is the Boolean form of £;: & = (—1)™. Similarly the 
generating polynomial for the register sequence T is 
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Pe p or, ‘ ae it 1); x y 
Fic. 5.11. Encoder corresponding to the code of Fig. 5.10. Ji ) is formed from 
. . ; 2 . 

the three consecutive register bits and K ) from two bits. 


N 
Gays >. Ga yaa (5.96) 
j=l 


The non-recursive convolutional code has G(x) = H(z), but this is not the case 
for the recursive code to be explained in the next subsection. The code word 
J‘ (a = 1,2) is written as 


N 
L@)(g) = i a (5.97) 
j=l 


i {æ} : 
with JS) = (-1)43 . The relation between L‘“) (x) and G(x) is determined by 
(5.92) and is described by another polynomial 


gala) = saje (5.98) 


j=0 
as 
L (g) = gale)G (x) (5.99) 
or equivalently : 
LI = D K(j;a)Gi_; (mod 2). (5.100) 
j=0 


The right hand side is the convolution of x and G, from which the name of 
convolutional code comes. 

The examples 1 and 2 of 85.7.1 have the generating polynomials as (1) gi (a) = 
1+ a and go(x) = 1, and (2) g(x) = 1 + £ + 27 and go(x) = 1 + 27. 


5.7.3 Recursive convolutional code 


The relation between the source and register sequences € and r (or H(a) and 
G(x)) is not simple in the recursive convolutional code. The register sequence of 
recursive convolutional code is defined by the generating polynomial as 
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(1) 
Ji 


© EÀ E2(1) 


O 

2 
df? 
ie 


Fic. 5.12. Encoder of the recursive convolutional code to be compared with the 
non-recursive case of Fig. 5.11. 


eye -gH (5.101) 


i(z) 


Code words satisfy L® (x) = ga(x)G(x) and therefore we have 


LY (a) = H(£), L(x) = BID Ha), (5.102) 
1 


The first relation means J® = € in the binary representation. 

The relation between the source and register sequences (5.101) can be written 
in terms of the binary representation as follows. Equation (5.101) is seen to 
be equivalent to G(x) = H(a) + (gil) — 1)G(x) because G(x) = —G(ax) and 
H(x) = —H(x) (mod 2). The coefficient of æt in this relation is, if we recall 
&(0;1) = 1, Gi = Hi + aan K(j;1)G;_;, which has the binary representation 


i F 
n= [ot ea] m] (5.103) 


This equation allows us to determine 7; recursively; that is, 7; is determined if 
we know 71,...,7;-1- From the definition L® (£) = gq(x)G(x), code words are 
expressed in terms of the register sequence in the same form as in the case of 
the non-recursive convolutional code: 
A 
ra) nlj) 
eal Goer) A (5.104) 
j=0 


The encoder for the code of example 2 of §5.7.1 is shown in Fig. 5.12. Decoding 
is carried out under the Hamiltonian 


N r T 
H= pS J [hio t” a JP Iiep t” , (5.105) 
i=l j=0 je 
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where J; i ig the noisy version of the code word J; ie) . According to (5.103), the 


ith bit is inferred at the inverse temperature (@ as 


p 
ĉi = sgn (Hea) l (5.106) 
8 


j=0 
which is to be contrasted with the non-recursive case 
Ei = sgn(oj),. (5.107) 


5.8 Turbo code 


The turbo code is a powerful coding/decoding technique frequently used recently. 
In has near-optimal performance (i.e. the transmission rate can be made close 
to the Shannon bound under the error-free condition), which is exceptional in a 
practicable code. We explain its statistical-mechanical formulation and some of 
the results (Montanari and Sourlas 2000; Montanari 2000). 

The turbo code is a variant of the recursive convolutional code with the 
source message sequence E = {€,,...,€v} and the permuted sequence €” = 
{Epc1),.--,€pcn)} as the input to the encoder. The permutation P operates on 
the set {1,2,...,N} and is fixed arbitrarily for the moment. Correspondingly, 
two register sequences are generated according to the prescription of the recursive 
convolutional code (5.103): 


Š 
T = n(€) = & [] yr (5.108) 
a e Pr 
D = (EP) = Ero) [aie yrs} (5.109) 
j=l 
or, equivalently, 
r 
&=[[ TEED = e(r™®) (5.110) 
E48) 
a 
E a a (5.111) 
pot 


The code words are comprised of three sequences and the rate is R = 1/3: 


a ? 
1 {1 K 2) 2 ol 932 : 
“I To) ) KAS; oe J$ Ka | [ (rh Ja an (j; 2) J ) a l Cag 2) (5.112) 


The casi to be used at the receiving end of the channel has the following 
expression: 


DO, 5 


Po .0 a” = z erol a), (a) 
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-exp{—GH(o a}, (5.113) 


where the Hamiltonian is, corresponding to (5.112), 


N r 
Hlo, a) a y JO [lee yo 
j=0 


izl 
go 2 TY (62 
7d 1) \e(4;2) F(a 2 (932) 
oe LG de Ler (5.114) 
j=0 j= 
The interactions Jo ; B D and JO are the noisy versions of the code words 


Jo) ; I”, and ga, respectively. The system (5.114) is a one-dimensional spin 
glass composed of two chains (a0) and o?)) interacting via the constraint 
Epli y(aD) = = ¢;(0)) (Vi). In decoding, one calculates the thermal expectation 
value of the variable representing the original bit, (5.110), using the posterior 
(5.113): 

& = sgn(e(o™))g. (5.115) 
The finite-temperature (MPM) decoding with the appropriate ( is used in prac- 
tice because an efficient TAP-like finite-temperature iterative algorithm exists as 
explained later briefly. 

To understand the effectiveness of turbo code intuitively, it is instructive 
to express the spin variable oi) in terms of the other set o'?). In example 
1 of 85.7.1, we have «(0;1) = «(1;1) = 1 and therefore, from the constraint 
eple ®) = e(o), og?) = eons , see (5.110) and (5.111). We thus 


(1 i 1) 0 i 2 2 : ; 
have o; ) = = ees Ha = Mar p17 Pg- , with gq = 1 for j < 0. 


If ¢ is of O(N) and the Doaa P is random, it is very plausible that this 
final product of the a) is composed of O(N) different o'?). This means that 
the Hamiltonian (5.114) has long-range interactions if expressed only in terms 
of o), and the ferromagnetic phase (in the ferromagnetic gauge) is likely to 
have an enhanced stability compared to simple one-dimensional systems. We 
may therefore expect that good performance is achieved in a turbo code with 
random permutation P, which is indeed confirmed to be the case in numerical 
experiments. 

The decoding algorithm of the turbo code is described as follows. One pre- 
pares two chains labelled by a = 1,2 with the Hamiltonian 


HO (o) ee, S T + p IC : rG; 1) 2 J) NG ee 


i j=0 
(5.116) 
Then one iteratively solves a set of TAP-like equations for the effective fields 
p% that represent the effects of the other chain: 


r” (k +1) = 8 tanh (ep-1 ale Pha -TE (k) (5.117) 
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i 


To?) (k +1) = 6" tanh Mepay (ayy) ~ Pa: (5.118) 
where (--}(® is the thermal average with the Hamiltonian H (o(%) and k 


the full system (5.114) and yet yields excellent performance numerically. 

Detailed statistical-mechanical analysis of the system H® (a) + HO (o) 
with cepa (a ®) = e;(o)) has been carried out (Montanari 2000). We describe 
some of its important results. Let us suppose that the channel is Gaussian and 8 
is adjusted to the optimal value (MPM). The S/N ratio is denoted as 1/w?. There 
exists a phase of error-free decoding (overlap M = 1) that is locally unstable in 
the high-noise region w? > w?. The numerical values w? are 1/log4 = 0.721 
for the code 1 of 85.7.1 and 1.675 for the code 2. The latter is very close to the 
Shannon limit w2 = 1/(2?/? —1) = 1.702 derived by equating the capacity of the 
Gaussian channel with the rate R = 1/3: 


1 1 1 
~ logs | 1+ m5 |] = -. 5.119 
joe (1+ ig) = 3 ace 


The limit of the first example (we = 0.721) is found to be close to numerical 
results whereas the second (we = 1.675) shows some deviation from numerical 
results. The stability analysis leading to these values may not give the correct 
answer if a first-order phase transition takes place in the second example. 


5.9 CDMA multiuser demodulator 


In this section we present a statistical-mechanical analysis of signal transmission 
by modulation (T. Tanaka 2001). This topic deviates somewhat from the other 
parts of this chapter. The signal is not encoded and decoded but is modulated 
and demodulated as described below. Nevertheless, the goal is very similar to 
error-correcting codes: to extract the best possible information from a noisy 
output using the idea of Bayesian inference. 


5.9.1 Basic idea of CDMA 


Code-division multiple access (CDMA) is an important standard of modern mo- 
bile communications (Simon et al. 1994; Viterbi 1995). The digital signal of a 
user is modulated and transmitted to a base station through a channel that is 
shared by multiple users. At the base station, the original digital signal is re- 
trieved by demodulation of the received signal composed of the superposition of 
multiple original signals and noise. An important problem is therefore to design 
an efficient method to modulate and demodulate signals. 

In CDMA, one modulates a signal in the following way. Let us focus our 
attention to a signal interval, which is the time interval carrying a single digital 
signal, with a signal €; (= +1) for the ith user. The signal interval is divided into 
p chip intervals (p = 4 in Fig. 5.13). User 7 is assigned a spreading code sequence 
ni (= £1) (t = 1,...,p). The signal £; is modulated in each chip interval t by the 
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Fic. 5.13. Modulation of the signal of a single user in CDMA. A signal interval 
is composed of four chip intervals in this example. The full line represents 
the original signal and the dashed line denotes the spreading code sequence. 


spreading code sequence according to the multiplication n{€;. Modulated signals 
of N users are superimposed in a channel and are further disturbed by noise. At 
the base station, one receives the signal 


N 
v= f+ (5.120) 
i=1 


at the chip interval t and is asked to retrieve the original signals €; (i = 1,..., N) 
from y(t = 1,...,p) with the knowledge of the spreading code sequence nt 
(t=1,... rt Ws cag N 

Before proceeding to the problem of demodulation, we list a few points of 
idealization that lead to the simple formula (5.120): modulated signals of N 
users are assumed to be transmitted under perfect synchronization at each chip 
interval ¢ throughout an information signal interval. This allows us simply to 
sum up nj& over all i (= 1,..., N) at any given chip interval t. Furthermore, all 
signals are supposed to have the same amplitude (normalized to unity in (5.120)), 
a perfect power control. Other complications (such as the effects of reflections) 
are ignored in the present formulation. These aspects would have to be taken 
into account when one applies the theory to realistic situations. 

The measure of performance is the overlap of the original (€;) and demodu- 
lated (€;) signals 


N 
1 2 


averaged over the distributions of &,7{, and vf. Equivalently, one may try to 
minimize the bit-error rate (the average probability of error per bit) (1 — _M)/2. 
We show in the following that the CDMA multiuser demodulator, which uses 
Bayesian inference, gives a larger overlap than the conventional demodulator. 
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5.9.2 Conventional and Bayesian demodulators 


Let us first explain the simple method of the conventional demodulator. To 
extract the information of €; from y*, we multiply the received signal at the tth 
chip interval y’ by the spreading code nf and sum it up over the whole signal 
interval: 


P 


h; = tt = = +a 3 5 néntEk + ay ` Doni (5.122) 


t=1 k(4i) t=] 


The first term on the right hand side is the original signal, the second represents 
multiuser interference, and the third is the channel noise (which is assumed to 
be Gaussian). We then demodulate the signal by taking the sign of this quantity 


£i = sgn(hi). (5.123) 


It is easy to analyse the performance of this conventional demodulator in the 
limit of large N and p with a = p/N fixed. We also assume that the noise power 
o2, the variance of +t, scales with N such that 3, = N/o? is of O(1), and that ni 
and &, are all independent. Then the second and third terms on the right hand 
side of (5.122) are Gaussian variables, resulting in the overlap 


Ya a 2 
-gt Ta ee 


where Erf(x) is the error function foe e~t dt. This represents the performance of 
the conventional demodulator as a function of the number of chip intervals per 
signal interval œ and the noise power (sy. 

To improve the performance, it is useful to construct the posterior of the 
original signal, given the noisy signal, following the method of Bayesian infer- 
ence. Let us denote the set of original signals by € = (Ens. . En) and the 
corresponding dynamical variables for demodulation by S = '($1,...,Sy). The 
sequence of received signals within p chip intervals is also written as a vector in 
a p-dimensional space y = ‘(y',...,y”). Once the posterior P(S|y) is given, one 
demodulates the signal by the MAP or MPM: 


MAP: &=arg max P(S\y), (5.125) 
MPM: & = arg max Trs\s,P(S|y). (5.126) 


To construct the posterior, we first write the distribution of Gaussian noise 
yt = yt ve Dia nie; (5.120), as 


N 9 


p 2 
5 (v - Snie) x exp{—G.H(€)}, (5.127) 
t=] toe | 


J Ba 
[Pele xX exp Sr 


t=] 
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where the effective Hamiltonian has been defined as 


H(é) = 5 Saas Sits dae + nn (5.128) 


i, j=1 tee] 


The field h; has already been defined in (5.122). If we assume that the prior is 
uniform, P(€) = const, the posterior is seen to be directly proportional to the 
prior according to the Bayes formula: 


P(S|y) x exp{—3.H(S)}. (5.129) 


The Hamiltonian (5.128) looks very similar to the Hopfield model to be discussed 
in Chapter 7, (7.7) with (7.4), the only difference being that the sign of the 
interaction is the opposite (Miyajima et al. 1993). 


5.9.3 Replica analysis of the Bayesian demodulator 


The replica method is useful to analyse the performance of the Bayesian demodu- 
lator represented by the posterior (5.129). 

Since we usually do not know the noise power of the channel s, it is appro- 
priate to write the normalized posterior with an arbitrary noise parameter Ø in 
place of s, the latter being the true value. From (5.127)-(5.129), we then find 


P(S|r) = ae -£5 (r Ss) (5.130) 


where the vector r denotes '(r',...,r?) with rt = yt/V N, and the normalization 
factor (or the partition function) is given as 


p N 2 
Z(r) = 27^ Trg exp z ~~ ( -NYY si) G. (5.131) 


t=1 izsl 


The factor 27^ is the uniform prior for €. The macroscopic behaviour of the 
system is determined by the free energy averaged over the distributions of the 
spreading code sequence, which is assumed to be completely random, and the 
channel noise. The latter distribution of noise is nothing more than the partition 
function (5.131) with the true hyperparameter 3 = 8g, which we denote by Zo(r). 
The replica average is therefore expressed as 


[Z"] = a IQA [Zo(r)Z"(r)]ns (5.132) 


where the configurational average on the right hand side is over the spreading 
code sequence. It is convenient to separate the above quantity into the spin- 
dependent part gı and the rest gə for further calculations: 
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[z"] = =] I| iog e, (5.133) 


O<a< Ben 


where a = p/N in the exponent should not be confused with the replica index. 
The zeroth replica (a = 0) corresponds to the probability weight Zo. The two 
functions gı and go are defined by 


eNA = Trs I] (Sa Sg -— NQag) (5.134) 
0<a< Gon 
eI? = for {Se — vo) — p 5 — m} : (5.135) 
a=] n 


where the following notations have been used: 


N 
1 
Uy = ee Midia (&a=1,... n). 5.136 
a A li io ( ) ( ) 


In the thermodynamic limit p, N —> oo with their ratio a fixed, these vg and 
Va become Gaussian variables with vanishing mean and covariance given by the 
overlap of spin variables, under the assumption of a random distribution of the 
spreading code sequence: 


Sa: Sg 
N 


To proceed further, we assume symmetry between replicas (a = 1,...,n): 
Qoa = M, Qag = q (a, 8 > 1). Then vo and va are more conveniently written in 
terms of independent Gaussian variables u,t, and Za with vanishing mean and 
unit variance, 


Qag = [vaveln = (a, B=0,... om): (5.137) 


2 trr 
v= u iea Va = Zay l1 =q —-tyq la 21) (5.138) 


We are now ready to evaluate the factor e9? explicitly as 


i Bs m* tr 
e = far fDi f Du ep -6 T E ETER 
; 2 q V4 


{fo exp (- Piz VIa -tva =r) 
siyal- pr VA e= e 


i n _ Bsltm/Yatr)  nBtya+tr? 
far [pe p {- 2{1 + A m/o} eee 


= Vin{1+p—g} en? 
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-Bail + BO — ¢g)} + nB{1 + B31 — 2m + gy? ; (5.139) 


The other factor e™%: (5.134) can be evaluated using the Fourier representation 
of the delta function 


Nir | dMag 
O<a<G<n ani 
-expN ¢logG(M)- XO MasQag >: (5.140) 


0<a< Gein 


G(M) = Trg exp 5. Magbasa 


O<a< Gn 

= 2 | Ds (2cosh(VF z + E))"e""*/?, (5.141) 

where we have used the RS form of the matrix Mog = E and Mag = F (a Æ 
B > 1). In the thermodynamic limit, the leading contribution is 

n 

2 


From (5.139) and (5.142), the total free energy gı + age is given in the limit 
n -> 0 as 


gi = log | Dz (2cosh(VF z + E))” — ~F —nEm— snr —1)Fq. (5.142) 


-Bf = fo log 2 cosh( VF z 4 E) — Em “FU q) 
B{1 + B3(1 ook : 
3{1+80-9) . (5.143) 


Extremization of the free energy yields the equations of state for the order pa- 
rameters as 


m= [>: tanh(VF z + E), q= [>: tanh?(/Fz+E) (5.144) 


B= aß _ Pa aß? (By +1- 2m +q) 
1+ 6(1—-4q)’ {1+ 6(1 -qg 


The overlap is determined from these quantities by 


-7 {lost + B(1~4@)} 4 


(5.145) 


M= / Dzsgn(VF z + E). (5.146) 
The stability limit of the RS solution, the AT line, is expressed as 
a= E? [o sech*(VF z + E). (5.147) 


The optimum demodulation by MPM is achieved at the parameter 6 = fa 
whereas the MAP corresponds to @ — œ. 
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(b) Bs = 20 


Fic. 5.14. Bit-error rate of the CDMA demodulators. The left one (a) is for 
the noise power f, = 1 and the right (b) is for G, = 20. The symbols are: 
Opt. for the MPM, MFA for the mean-field demodulator with @ = fs, and 
CD for the conventional demodulator (T. Tanaka 2001; Copyright 2001 by 
the Massachusetts Institute of Technology). 


5.9.4 Performance comparison 

The results of the previous analysis in terms of the bit-error rate (1 — M)/2 
are plotted in Fig. 5.14 for (a) @ = 1 and (b) 6; = 20 for the conventional 
demodulator (CD), MPM (‘Opt.’), and MAP demodulators. Also shown is the 
mean-field demodulator in which one uses the mean-field equation of state for 
local magnetization 


m; = tanh{A(~}~ Jijmj + ha)} (5.148) 
j 


in combination with € = sgn(m;). This method has the advantage that it serves 
as a demodulating algorithm of direct practical usefulness. 

It is observed that the MAP and MPM show much better performance than 
the conventional demodulator. The curve for the MAP almost overlaps with the 
MPM curve when the noise power is high, 3, = 1, but a clear deviation is found 
in the low-noise case 6; = 20. The MPM result has been confirmed to be stable 
for RSB. By contrast, one should take RSB into account for the MAP except in 
a region with small œ and large 6s. 


Bibliographical note 


General expositions of information theory and error-correcting codes are found 
in textbooks on these subjects (McEliece 1977; Clark and Cain 1981; Lin and 
Costello 1983; Arazi 1988; Rhee 1989; Ash 1990; Wicker 1995). The present 
form of statistical-mechanical analysis of error-correcting codes was proposed by 
Sourlas (1989) and has been expanding rapidly as described in the text. Some 
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of the recent papers along this line of development (but not cited in the text) 
include Kanter and Saad (2000), Nakamura et al. (2000), and Kabashima et al. 
(2000c). See also Heegard and Wicker (1999) for the turbo code. 


6 
IMAGE RESTORATION 


The problem of statistical inference of the original image given a noisy image can 
be formulated in a similar way to error-correcting codes. By the Bayes formula 
the problem reduces to a form of random spin systems, and methods of statis- 
tical mechanics apply. It will be shown that image restoration using statistical 
fluctuations (finite-temperature restoration or MPM) gives better performance 
than the MAP if we are to maximize the pixel-wise similarity of the restored 
image to the original image. This is the same situation as in error-correcting 
codes. Mean-field treatments and the problem of parameter estimation will also 
be discussed. 


6.1 Stochastic approach to image restoration 


Let us consider the problem of inference of the original image from a given digital 
image corrupted by noise. This problem would seem to be very difficult without 
any hints about which part has been corrupted by the noise. In the stochastic 
approach to image restoration, therefore, one usually makes use of empirical 
knowledge on images in general (a priori knowledge) to facilitate reasonable 
restoration. The Bayes formula plays an important role in the argument. 


6.1.1 Binary image and Bayesian inference 
We formulate the stochastic method of image restoration for the simple case of 
a binary (‘black and white’) image represented by a set, of Ising spins € = {&;}. 
The index i denotes a lattice site in the spin system and corresponds to the pixel 
index of an image. The set of pixel states € is called the Markov random field in 
the literature of image restoration. 

Suppose that the image is corrupted by noise, and one receives a degraded 
(corrupted) image with the state of the pixel 7; inverted from the original value 
& with probability p. This conditional probability is written as 


Pi a exp(Gp71&:) 
(Alors TS ag? 

cosh Gp 
where ĝp is the same function of p as in (5.7). Under the assumption of indepen- 
dent noise at each pixel, the conditional probability for the whole image is the 
product of (6.1): 


(6.1) 


I l 
P(r|€) = Gona exP(Bp 9 7:6), (6.2) 


where N is the total number of pixels. 
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The problem is to infer the original image €, given a degraded image r. For 
this purpose, it is useful to use the Bayes formula (5.10) to exchange the entries 
T and € in the conditional probability (6.2). We use the notation ø = {o;} for 
dynamical variables to restore the image which are to be distinguished from the 
true original image €. Then the desired conditional probability (posterior) is 

exp(Bp >; nioi) P(o) 


cls Tre exp(Bp >>, tiai) P(e) P 


Here the original image is assumed to have been generated with the probability 
(prior) P(e). 

One usually does not know the correct prior P(e). Nevertheless (6.3) shows 
that it is necessary to use P(o) in addition to the given degraded image r to 
restore the original image. In error-correcting codes, it was reasonable to assume 
a uniform prior. This is not the case in image restoration where non-trivial 
structures (such as local smoothness) are essential. We therefore rely on our 
knowledge on images in general to construct a model prior to be used in place 
of the true prior. 

Let us consider a degraded image in which a black pixel is surrounded by 
white pixels. It then seems natural to infer that the black pixel is likely to have 
been caused by noise than to have existed in the original image because real 
images often have extended areas of smooth parts. This leads us to the following 
model prior that gives a larger probability to neighbouring pixels in the same 
state than in different states: 


P(e) = a. me Ei 


where the sum (ij) runs over neighbouring pixels. The normalization factor 
Z (Bm) is the partition function of the ferromagnetic Ising model at temperature 
Tm = 1/Bm. Equation (6.4) represents our general knowledge that meaning- 
ful images usually tend to have large areas of smooth parts rather than rapidly 
changing parts. The Sm is the parameter to control smoothness. Larger m means 
a larger probability of the same state for neighbouring pixels. 


(6.4) 


6.1.2 MAP and MPM 


With the model prior (6.4) inserted in the Bayes formula (6.3), we have the 
explicit form of the posterior, 


exp(Bp Ži TiTi + Bm 2 lij) iTi) 


POT == a eS 
A Tro exp(Bp 02; Tii + Bm 2 ouijy F177) 


(6.5) 


The numerator is the Boltzmann factor of an Ising ferromagnet in random fields 
represented by r. We have thus reduced the problem of image restoration to the 
statistical mechanics of a random-field Ising model. 
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If one follows the idea of MAP, one should look for the ground state of the 
random-field Ising model because the ground state maximizes the Boltzmann 
factor (6.5). Note that the set r, the degraded image, is given and fixed, which 
in other words represents quenched randomness. Another strategy (MPM) is to 
minimize the pixel-wise error probability as described in §5.2.3 and accept sgn(o;) 
as the restored value of the ith pixel calculated through the finite-temperature 
expectation value. It should also be noted here that, in practical situations of 
restoration of grey-scale natural images, one often uses multivalued spin systems, 
which will be discussed in §§6.4 and 6.5. 


6.1.3 Overlap 


The parameter 8p in (6.5) represents the noise rate in the degraded image. One 
does not know this noise rate beforehand, so that it makes sense to replace it 
with a general variable h to be estimated by some method. We therefore use the 
posterior (6.5) with J, replaced by h. Our theoretical analysis will be developed 
for a while for the case where the original image has been generated according 
to the Boltzmann factor of the ferromagnetic Ising model: 


expl 8s ys) &i€;) 
Z (Bs) 


P(€) = 


where @, is the inverse of the temperature T, of the prior. 
We next define the average overlap of the original and restored images as in 
(5.19) 


M (Bm, h) = TreTr, P(E) P(7|€) {& sgn(oi) } 
1 
~ cosh B,)V ZB.) 


“Tre Tr; exp | 2s 5 E£; R Bp a TiGi {€isgn(o;)}. (6.7) 


(iJ) i 


Here ({ø;) is the average by the Boltzmann factor with 3, replaced by h in (6.5). 
The dependence of M on m and h is in the quantity sgn(a;). The overlap 
M (Bm, h) assumes the largest value when Om and h are equal to the true values, 
Ps and Bp, respectively: 

M (Bm, R) < M(Bs, Bp) (6.8) 


This inequality can be proved in the same way as in §5.3.2. 

The inequality (6.8) has been derived for the artificial image generated by 
the Ising model prior (6.6). It is not possible to prove comparable results for 
general natural images because the prior is different from one image to another. 
Nevertheless it may well happen in many images that the maximization of pixel- 
wise overlap is achieved at finite values of the parameters Ôm and h. 
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We have been discussing noise of type (6.2), a simple reversal of the binary 
value. Similar arguments can be developed for the Gaussian noise 


Tin T 7 2 
Pile) = zr] Ds ( os (6.9) 


The inequality for the maximum overlap between pixels, corresponding to (6.8), 
is then 


M(8m,h) < M (8s, a) (6.10) 


6.2 Infinite-range model 


The true values of the parameters Øm and h (8, and Bp, respectively) are not 
known beforehand. One should estimate them to bring the overlap M close to 
the largest possible value. It is therefore useful to have information on how the 
overlap M (Bm, h) depends upon the parameters near the best values Bm = 3, and 
h = Bp. The infinite-range model serves as a prototype to clarify this point. In the 
present section we calculate the overlap for the infinite-range model (Nishimori 
and Wong 1999). 


6.2.1 Replica calculations 


Let us consider the infinite-range model with the following priors (the real and 
model priors): 


P exp ((8./2N) ids £5) P exp ((Om/2N) ids 7103) Sil 

i Z) ER ZBm) k 
This model is very artificial in the sense that all pixels are neighbours to each 
other, and it cannot be used to restore the original image of a realistic two- 
dimensional degraded image. However, our aim here is not to establish a model 
of practical usefulness but to understand the generic features of macroscopic 
variables such as the overlap M. It is well established in statistical mechanics 
that the infinite-range model is suited for such a purpose. 

For the Gaussian noise (6.9), we can calculate the overlap M (Bm, A) by the 
replica method. The first step is the evaluation of the configurational average of 
the nth power of the partition function: 


ee) Mi eae (> ak ) 


p, 2 (6/20) igs 665 + (0/7 Eint) 
= Z(Bs) 


‘Trg exp oR pe 3 opo; +h Ds ` TiO; 
ifj Q i a 
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1 1 2 2 
Én dr: e TES 2 2 
Z (6V 2rT)™ J eps ( ara 2A ET ) 
; i ARELA dmo I] dma Tre Tre 
( 27 20 . 


NBR. Oe Rone. ; 
exp i smo S i 
l 2 2 2 ; 


N Bm 2 70 
= aa si + Bm >A Ma oe + Ds (2e +h 3 n} 
OF X t t oO 
1 J J Ba 7 14 Biri } 
x -soo | dmo dmg exp N 4 mmp m m > mA 
ZB) e i ee 


+ log Tr I Du exp (oumas + Bm >; Mao 


+ Toh ae o” + hru > a”) | ‘ (6.12) 
Om X 


where Tr denotes the sums over o% and €. We write [Z"] = exp(—GmnNf), 
and evaluate —Ømnf to first order in n by steepest descent assuming replica 
symmetry, 


1 , Le 
—~Bmnf = 5am + log 2 cosh Osmo — z mm? 


nuts if Du ef» mos log 2 cosh(8mm + TohE + Thu) 


: 2 cosh Amo 


, (6.13) 


where Trg is the sum over € = +1. 
By extremization of the free energy at each order of n, we obtain the equations 
of state for order parameters. From the n-independent terms, we find 


mo = tanh famo. (6.14) 


This represents the ferromagnetic order parameter mo = [€] in the original 
image. It is natural to have a closed equation for mo because the original image 
should not be affected by degraded or restored images. 

The terms of O(n) give 


ae Tre f Due®:™s tanh(8.m + TohE + Thu) (6.15) 
2 cosh 3,mo 


This is the equation for the ferromagnetic order parameter m = [(a;)]| of the 
restored image. The overlap M can be calculated by replacing tanh(-) in (6.15) 
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Fic. 6.1. The overlap as a function of the restoration temperature 


by €sgn(-) as in §5.4.4. Here, we should remember that we cannot use the ferro- 
magnetic gauge in image restoration (because the prior is not a constant) and 
the value € remains explicitly in the formulae. 


M 6.16 
2 cosh smo ( ) 
The information on the original image (6.14) determines the order parameter of 
the restored image (6.15) and then we have the overlap (6.16). 


6.2.2 Temperature dependence of the overlap 


It is straightforward to investigate the temperature dependence of M by numer- 
ically solving the equations for mo,m, and M in (6.14), (6.15), and (6.16). In 
Fig. 6.1 we have drawn M as a function of Tm = 1/8m by fixing the ratio of 
Bm and h to the optimum value 8s/(To/T°) determined in (6.10). We have set 
T; = 0.9,7 = 7 = 1. The overlap is seen to be a maximum at the optimal 
parameter Tm = 0.9 (= T;). The MAP corresponds to Tm > 0 and the overlap 
there is smaller than the maximum value. It is clear that the annealing process 
(in which one tries to reach equilibrium by decreasing the temperature from a 
high value) gives a smaller overlap if one lowers the temperature beyond the 
optimal value. 


6.3 Simulation 


It is in general difficult to discuss quantitatively the behaviour of the overlap M 
for two-dimensional images by the infinite-range model. We instead use Monte 
Carlo simulations and compare the results with those for the infinite-range model 
(Nishimori and Wong 1999). 

In Fig. 6.2 is shown the overlap M of the original and restored images by 
finite-temperature restoration. The original image has 400 x 400 pixels and was 
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Fic. 6.2. The overlap as a function of the restoration temperature for a 
two-dimensional image. 


generated by the prior (6.11) (T; = 2.15). Degradation was caused by the binary 
noise with p = 0.1. The overlap M assumes a maximum when the restoration 
temperature Tm is equal to the original T, = 2.15 according to the inequality 
(6.8), which is seen to be true within statistical uncertainties. In this example, 
the parameter h has been changed with Øm so that the ratio of Gm, and h is kept 
to the optimum value 35/8». 

Comparison with the case of the infinite-range model in Fig. 6.1 indicates 
that M depends relatively mildly on the temperature in the two-dimensional 
case below the optimum value. One should, however, be aware that. this result 
is for the present specific values of T, and p, and it still has to be clarified how 
general this conclusion is. 

An explicit illustration is given in Fig. 6.3 that corresponds to the situation 
of Fig. 6.2. Figure 6.3(a) is the original image (T, = 2.15), (b) is the degraded 
image (p = 0.1), (e) is the result of restoration at a low temperature (Tm = 0.5), 
and (d) has been obtained at the optimum temperature (Tm = 2.15). It is clear 
that (d) is closer to the original image than (c). The MAP has Tm = 0 and is 
expected to give an even less faithful restored image than (c), in particular in 
fine structures. It has thus become clear that the MPM, the finite-temperature 
restoration with correct parameter values, gives better results than the MAP for 
two-dimensional images generated by the ferromagnetic Ising prior (6.11). 


6.4 Mean-field annealing 


In practical implementations of image restoration by the MAP as well as by 
the MPM, the required amount of computation is usually very large because 
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(a) (b) (c) (d) 


Fic. 6.3. Restoration of an image generated by the two-dimensional Ising model: 
(a) original image, (b) degraded image, (c) restored image at a very low tem- 
perature (close to MAP), and (d) restored image at the optimum temperature 
(MPM). 


there are 2" degrees of freedom for a binary image. Therefore one often makes 
use of approximations, and a typical example is mean-field annealing in which 
one looks for the optimum solution numerically using the idea of the mean-field 
approximation (Geiger and Girosi 1991; Zhang 1992; Bilbro et al. 1992). 


6.4.1 Mean-field approximation 


We now generalize the argument from binary to grey-scale images to be repre- 
sented by the Potts model. Generalization of (6.5) to the Potts model is 


_ exp(—GpH (lr) 


P(o|r) 7 (6.17) 

H(a|r) =- 4(04,7%1) -J$ êlo oj) (6.18) 
i (ii) 

Z = Tro exp(—fpH(ajr)), (6.19) 


where r and ø are Q-state Potts spins (7;,0; = 0,1,...,Q — 1) to denote grey 
scales of degraded and restored images, respectively. In the ferromagnetic Potts 
model, (6.18) with J > 0, the interaction energy is —J if the neighbouring spins 
(pixels) are in the same state 0; = gj and zero otherwise. Thus the neighbouring 
spins tend to be in the same state. The Ising model corresponds to Q = 2. The 
MAP evaluates the ground state of (6.18), and the MPM calculates the thermal 
average of each spin g; at an appropriate temperature. 

Since it is difficult to evaluate (6.17) explicitly, we approximate it by the 
product of marginal distributions 


piln) = We Pe |\r dn. 03) (6.20) 


as 


P(o|r) ~ ] pile). (6.21) 
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The closed set of equations for p; can be derived by inserting the mean-field 
approximation (6.21) into the free energy 


F = Tre{H(o|r) + T, log P(a|r)}P(el|r) (6.22) 
and minimizing it with respect to p; under the normalization condition 


Tro [] piles) ae (6.23) 
i 


Simple manipulations then show that p; satisfies the following equation: 
Bola aa 

me v9 exp(—A ena 

HMF (n) = —6(n, ri) — J ‘> py (nr (6.25) 


n.n.Et 


pile) = (6.24) 


where the sum in the second term on the right hand side of (6.25) runs over 
nearest neighbours of i. 


6.4.2 Annealing 

A numerical solution of (6.24) can be obtained relatively straightforwardly by 
iteration if the parameters 8, and J are given. In practice, one iterates not for 
the function p; itself but for the coefficients {m? } 


Q=1 
gye a m P(o) (6.26) 


of the expansion of the function in terms of the complete orthonormal system of 
polynomials 


S lo)dy (o) = (l, l’). (6.27) 
o=0 
The following discrete Tchebycheff polynomials are useful for this purpose (Tanaka 
and Morita 1996): 


Yolo) = 1, P(o) =1- yar 
(+1 -—1-—1)Ym (e) 
= (20 ~-Q+1)(2l4+ 1)V¥i(o) -UQ+)VWj-1(c) (6.28) 


Vi(c) 
SS- Wi(o)? 


Multiplying both sides of (6.24) by ®)(a7) and summing the result over ø, we find 
from (6.27) and (6.26) 


P(o) = 
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Tro ® (a) exp TEG Ti) t BpJ DN ee my (o)} 
Oh d (6.29) 
a ZMF ' AT; 


where Zmr is the denominator of (6.24). The set of coefficients {m? } can thus be 
calculated by iteration. In practice, one usually does not know the correct values 
of fp and J, and therefore it is necessary to estimate them by the methods 
explained below. 

Equation (6.29) is a generalization of the usual mean-field approximation to 
the Potts model. To confirm this explicitly, we apply (6.24) and (6.25) to the 
Ising model (Q = 2): 


pilo) = m + mP — 20) (6.30) 
HMF (a) = -6(e,%) -J D {m + mY (1 — 2a)}, (6.31) 
n.n.€t 


where o = 0 or 1. Using the first two Tchebycheff polynomials Yo and Y4, we 
; ene T) ; ave a 
find from (6.29) that mí ) = 1 and the mean-field equation in a familiar form 


=- o l l 
m” = tanh | pJ pp m$” + fen) f (6.32) 


Mn 


where we have used the conventional Ising variables (+1 instead of 0 and 1). 

In the MAP (8p — oo) as well as in the MPM, one has to lower the temper- 
ature gradually starting from a sufficiently high temperature (p œ% 0) to obtain 
a reliable solution of (6.29). This is the process of mean-field annealing. 


6.5 Edges 


For non-binary images, it is useful to introduce variables representing discontin- 
uous changes of pixel values between neighbouring positions to restore the edges 
of surfaces as faithfully as possible (Geman and Geman 1984; Marroquin et al. 
1987). Such an edge variable u,; takes the value 0 (no edge between pixels i and 
j) or 1 (existence of an edge). In the present section, we solve a Gaussian model 
of image restoration with edges using the mean-field approximation (Geiger and 
Girosi 1991; Zerubia and Chellappa 1993; Zhang 1996; K. Tanaka 20010). 

Let us consider a @-state grey-scale image. The model prior is assumed to 
have a Gaussian form 


P(o, u) = 7 oP —Bm Sl — uiz){(oi — 03)? —¥7} ?, (6.33) 
(ij) 
where o; = 0,1,...,Q — I and uj = 0,1. Note that we are considering a Q- 


state Ising model here, which is different from the Potts model, The difference 
between neighbouring pixel values |o; — o;| is constrained to be small (less than 
y) if u; = 0 (no edge) to reflect the smoothness of the grey scale, whereas the 
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same difference |o; — a;| can take arbitrary values if uj; = 1 (edge). Thus this 
prior favours the existence of an edge if the neighbouring pixel values differ by 
a large amount.!? Noise is also supposed to be Gaussian 


.— £,)\2 
P(7|€) = I] = exp {Se E } (6.34) 


where the true original image € and degraded image r both have Q values at 
each pixel. The posterior is therefore of the form 


exp(—H(o, ul7)) 


POSTS Yu Tro exp(—H(o,ulr))’ 


(6.35) 


where 
H(o,ulr) = -8m YO = u{lo: = 03)? — 77} w En = 0)? (6.36) 
(ij) i 
In the finite-temperature (MPM) estimation, we accept the value n; that maxi- 


mizes the marginalized posterior 


P(ni|t) = $ Tre P(o, ul )5(a%, ni) (6.37) 


as the restored pixel state at i. 

It is usually quite difficult to carry out the above procedure explicitly. A 
convenient yet powerful approximation is the mean-field method discussed in 
the previous section. The central quantities in the mean-field approximation are 
the marginal probabilities 


piln) = ` Tre P(o, ulr)d(0;, n) 


u (6.38) 
piju) = X Tre P(o, u|T)(uiz, u). 
U 
The full probability distribution is approximated as 
P(o, ulr) ~ | | olo) [] pis (ui). (6.39) 
i (ij) 
The marginal probabilities are determined by minimization of the free energy 
F = ` Tro{H (o, ulr) + log P(o ulr)}P(o, ulr) (6.40) 
t 


12Tnteractions between edges represented by the products of the u; j are often included to take 
into account various types of straight and crossing edges of extended lengths in real images. 
The set {u,;} is called the line field in such cases. 
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with respect to p;(o;) and p;;(us;) under the normalization condition 


Oni 
5 pi(n)=1, `> pij(u) = 1. (6.41) 
n=0 u=0,1 
The result is 
e7 Filn) 
pi(n) es Buen e—Ei(m) 
0) e` Hs 0) 
Pit) Oe) 
J Wana e Ei; (k) 
7 (n — %)* 
En) = g (6.42) 
Q-1 
+E E E Blt — (n= m)? = 7 Fog(m) p15) 
JEG; k=0,1 m=0 
Q-1 
Eyll) = Bm X -m -mY — ¥7}0i(m)p;(m’), 
m,m’ sel) 


where G; is the set of neighbours of i. By solving these equations iteratively, we 
obtain p;(n), from which it is possible to determine the restored value of the ith 
pixel as the n that gives the largest value of p;(n). 

For practical implementation of the iterative solution of the set of equations 
(6.42) for large Q, it is convenient to approximate the sum over Q pixel values 
yeaa by the integral f ae dm because the integrals are analytically calculated 
to give 


Hoss, o cath (n — pi)? ; 
piln) = T. exp ( Ju? (6.43) 
O (20?) tri + Pm dee, (1 _ Aij Mj 


HERE A F e ee (b= Ay) A 
1 i l 
a = Dua -+ Bim 5 (1 re raz) (6.45) 
i GEG, 
1 | 
Ne (6.46) 


1+ exp[—Bm{w? +w? + (us — uy)? -PH 


It is straightforward to solve (6.44), (6.45), and (6.46) by iteration. Using the 
result in (6.43), the final estimation of the restored pixel value is obtained. 

An example is shown in Fig. 6.5 for Q = 256. The original image (a) has 
been degraded by Gaussian noise of vanishing mean and variance 900 into (b). 
The image restored by the set of equations (6.43) to (6.46) is shown in (c) 
together with a restored image (d) obtained by a more sophisticated cluster 
variation method in which the correlation effects of neighbouring sites are taken 
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(c) (d) 


Fic. 6.4. Restoration of 256 grey-scale image by the Gaussian prior and edges 
(line process). Degradation was by Gaussian noise of vanishing mean and 
variance 900: (a) original image, (b) degraded images, (c) restored image by 
mean-field annealing, and (d) restored image by cluster variation method. 
Courtesy of Kazuyuki Tanaka (copyright 2001). 


into account in the Bethe-like approximation (K. Tanaka 20016). Even in the 
mean-field level (c), discontinuous changes of pixel values (edges) around the 
eyes are well reproduced. The edges would have been blurred without the uj- 
term in (6.33). 


6.6 Parameter estimation 


It is necessary to use appropriate values of the parameters fp and J to restore 
the image using the posterior (6.17). However, one usually has only the degraded 
image and no explicit knowledge of the degradation process characterized by 
Bp or the parameter J of the original image. We therefore have to estimate 
these parameters (hyperparameters) from the degraded image only (Besag 1986; 
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Lakshmanan and Derin 1989; Pryce and Bruce 1995; Zhou et al. 1997; Molina 
et al. 1999). 

The following procedure is often used for this purpose. We first marginalize 
the probability of the given degraded image 7, erasing the original image 


P(t |Bp, J) = TeP(TIE, Bp) P(E, J). (6.47) 


The above notation denotes the probability of degraded image r given the pa- 
rameters fp and J. Since we know 7, it is possible to estimate the parameters 
Bp and J as the ones that maximize the marginalized likelihood function (6.47) 
or the evidence. However, the computational requirement for the sum in (6.47) 
is exponentially large, and one should resort to simulations or the mean-field 
approximation to implement this idea. 

A different strategy is to estimate ø that maximizes P(r|o,3,)P(o,J) as 
a function of 6, and J without marginalization of € in (6.47). One denotes the 
result as {G(Gp, J)} and estimates 3, and J that maximize the product 


P(T|{G(Bp, J)} Bp) P(LG(Bp, J)}, J). 


This method is called the maximum likelihood estimation. 

Another idea is useful when one knows the number of neighbouring pixel pairs 
L having different grey scales (Tanaka and Morita 1995; Morita and Tanaka 1996, 
1997; Tanaka and Morita 1997; K. Tanaka 2001a) 


L=  {1-5(&,&)}. (6.48) 
(ii) 


One then accepts the image nearest to the degraded image under this constraint 
(6.48). By taking account of the constraint using the Lagrange multiplier, we see 
that the problem is to find the ground state of 


H=-¢ bloun) -J L- X {1 - 6(0:,03)} 


(ij) 


The Potts model in random fields (6.18) has thus been derived naturally. The 
parameter J is chosen such that the solution satisfies the constraint (6.48). Figure 
6.5 is an example of restoration of a 256-level grey-scale image by this method 
of constrained optimization. 


13Precisely speaking, the restored image (c) has been obtained by reducing the 256-level 
degraded image to an eight-level image and then applying the constrained optimization method 
and mean-field annealing. The result has further been refined by a method called conditional 
maximization with respect to the grey scale of 256 levels. 
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O 


Fic. 6.5. Restoration of 256-level image by the Potts model: (a) original image, 
(b) degraded image, and (c) restored image. Courtesy of Kazuyuki Tanaka 
(1999). (Copyright 1999 by the Physical Society of Japan) 


Bibliographical note 


The papers by Geman and Geman (1984), Derin et al. (1984), Marroquin et 
al. (1987), and Pryce and Bruce (1995) are important original contributions on 
stochastic approaches to image restoration and, at the same time, are useful to 
obtain an overview of the field. For reviews mainly from an engineering point 
of view, see Chellappa and Jain (1993). Some recent topics using statistical- 
mechanical ideas include dynamics of restoration (Inoue and Carlucci 2000), state 
search by quantum fluctuations (Tanaka and Horiguchi 2000; Inoue 2001), hyper- 
parameter estimation in a solvable model (Tanaka and Inoue 2000), segmentation 
by the XY model (Okada et al. 1999), and the cluster variation method to im- 
prove the naive mean-field approach (Tanaka and Morita 1995, 1996; K. Tanaka 
20010). 


7 
ASSOCIATIVE MEMORY 


The scope of the theory of neural networks has been expanding rapidly, and 
statistical-mechanical techniques stemming from the theory of spin glasses have 
been playing important roles in the analysis of model systems. We summarize 
basic concepts in the present chapter and study the characteristics of networks 
with interneuron connections given by a specific prescription. The next chap- 
ter deals with the problem of learning where the connections gradually change 
according to some rules to achieve specified goals. 


7.1 Associative memory 


The states of processing units (neurons) in an associative memory change with 
time autonomously and, under certain circumstances, reach an equilibrium state 
that reflects the initial condition. We start our argument by elucidating the 
basic concepts of an associative memory, a typical neural network. Note that 
the emphasis is, in the present book, on mathematical analyses of information 
processing systems with engineering applications in mind (however remote they 
might be), rather than on understanding the functioning of the real brain. We 
nevertheless use words borrowed from neurobiology (neuron, synapse, etc.) be- 
cause of their convenience to express various basic building blocks of the theory. 


7.1.1 Model neuron 


The structure of a neuron in the real brain is schematically drawn in Fig. 7.1. A 
neuron receives inputs from other neurons through synapses, and if the weighted 
sum of the input signals exceeds a threshold, the neuron starts to emit its own 
signal. This signal is transmitted through an azon to many other neurons. 


| synapse 


Fic. 7.1. Schematic structure of a neuron 
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To construct a system of information processing, it is convenient to model 
the functioning of a neuron in a very simple way. We label the state of a neuron 
by the variable S; = 1 if the neuron is excited (transmitting a signal) and by 
Si = —1 when it is at rest. The synaptic efficacy from neuron 7 to neuron 7 will 
be denoted by 2J;;. Then the sum of signals to the ith neuron, h;, is written as 


hi = > a(S; +1). (7.1) 
j 


Equation (7.1) means that the input signal from j to i is 2J;; when Sj = 1 
and zero if S; = —1. The synaptic efficacy Ji; (which will often be called the 
connection or interaction) can be both positive and negative. In the former case, 
the signal from $; increases the value of the right hand side of (7.1) and tends to 
excite neuron 7; a positive connection is thus called an excitatory synapse. The 
negative case is the inhibitory synapse. 

Let us assume that the neuron i becomes excited if the input signal (7.1) 
exceeds a threshold 6; at time t and is not excited otherwise: 


Si(t + At) = sgn | X Jule) +1) - & |. (7.2) 
j 


We focus our argument on the simple case where the threshold 9; is equal to 
oo j Jij so that there is no constant term in the argument on the right hand side 
of the above equation: 


Si(t + At) = sgn | X Jig 5; (4) J (7.3) 
J 


7.1.2 Memory and stable fixed point 


The capability of highly non-trivial information processing emerges in a neural 
network when very many neurons are connected with each other by synapses, 
and consequently the properties of connections determine the characteristics of 
the network. The first half of the present chapter (up to 87.5) discusses how 
memory and its retrieval (recall) become possible under a certain rule for synaptic 
connections. 

A pattern of excitation of a neural network will be denoted by {€f}. Here 
i(=1,...,.N) is the neuron index, ys (= 1,...,p) denotes the excitation pattern 
index, and £” is an Ising variable (+1). For example, if the wth pattern has the 
ith neuron in the excited state, we write € = 1. The pth excitation pattern 
can be written as {éf, a ere Ear} and p such patterns are assumed to exist 
Uea Pi 

Let us suppose that a specific excitation pattern of a neural network corre- 
sponds to a memory and investigate the problem of memorization of p patterns 
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in a network of N neurons. We identify memorization of a pattern {€/}i21,...N 
with the fact that the pattern is a stable fixed point of the time evolution rule 
(7.3) that 5;(t) = & — S,(t + At) = &* holds at all i. We investigate the 
condition for this stability. 

To facilitate our theoretical analysis as well as to develop arguments inde- 
pendent. of a specific pattern, we restrict ourselves to random patterns in which 
each é“ takes +1 at random. For random patterns, each pattern is a stable fixed 
point as long as p is not too large if we choose Ji; as follows: 


LS puen 
Jig = S (7.4) 


u=1 


The diagonal term Jy is assumed to be vanishing (Ja = 0). This is called the 
Hebb rule. In fact, if the state of the system is in perfect coincidence with the 
pattern u at time t (i.e. S;(t) = &’, Vi), the time evolution (7.3) gives the state 
of the ith neuron at the next time step as 


an 
sgn | X Jyét | = sgn x 3y gee | = sen (= H in) = sgn (f;'), 
j joy 


(7.5) 
where we have used the approximate orthogonality between random patterns 


1 1 
F S hey itO (=z) (7.6) 
j 


Consequently we have S;(t+ At) = £ at all i for sufficiently large N. Drawbacks 
of this argument are, first, that we have not checked the contribution of the 
O(1/V N) term in the orthogonality relation (7.6), and, second, that the stability 
of the pattern is not clear when one starts from a state slightly different from 
the embedded (memorized) pattern (i.e. S;(t) = € at most, but not all, i). These 
points will be investigated in the following sections. 


7.1.3 Statistical mechanics of the random Ising model 


The time evolution described by (7.3) is equivalent to the zero-temperature dy- 
namics of the Ising model with the Hamiltonian 


1 l 1 . 
H=-; 2 Jys = -3 2, Si 2 JiS. (7.7) 


The reason is that >), JijSj is the local field h; to the spin S;,\4 and (7.3) aligns 
the spin (neuron state) S; to the direction of the local field at the next time step 


'4Note that. the present local field is different from (7.1) by a constant D Jig: 
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energy 


Fic. 7.2. Energy landscape and time evolution 


t+ At, leading to a monotonic decrease of the energy (7.7). Figure 7.2 depicts 
this situation intuitively where the network reaches a minimum of the energy 
closest to the initial condition and stops its time evolution there. Consequently, 
if the system has the property that there is a one-to-one correspondence between 
memorized patterns and energy minima, the system starting from the initial con- 
dition with a small amount of noise (i.e. an initial pattern slightly different from 
an embedded pattern) will evolve towards the closest memorized pattern, and the 
noise in the initial state will thereby be erased. The system therefore works as an 
information processor to remove noise by autonomous and distributed dynamics. 
Thus the problem is to find the conditions for this behaviour to be realized under 
the Hebb rule (7.4). 

We note here that the dynamics (7.3) definitely determines the state of 
the neuron, given the input hi(t) = 7, JijSj(t). However, the functioning of 
real neurons may not be so deterministic, which suggests the introduction of a 
stochastic process in the time evolution. For this purpose, it is convenient to 
assume that S;(t + At) becomes 1 with probability 1/(1 + e729") and is —1 
with probability e779" /(1 + e~?8(), Here 8 is a parameter introduced to 
control uncertainty in the functioning of the neuron. This stochastic dynamics 
reduces to (7.3) in the limit G -> oo because S;(t + At) = 1 if h;(t) > 0 and is 
Si(t + At) = —1 otherwise. The network becomes perfectly random if 3 = 0 (see 
Fig. 7.3). 

This stochastic dynamics is equivalent to the kinetic Ising model (4.93) and 
(4.94). To confirm this, we consider, for example, the process of the second term 
on the right hand side of (4.94). Since S” is a spin configuration with S; inverted 
to —S; in S = {S;}, we find A(S”, S) = H(S”) — H(S) = 29;h,. The transition 
probability of this spin inversion process is, according to (4.94), 


a : i (7.8) 

LS paene a SE care, . 
1+exp(GA(S",S)) 14 e285:h 

If S; = 1 currently, the possible new state is S; = —1 and the transition prob- 


ability for such a process is w = (1 + e??P)~! = e7?8hi/(1 + e7 2PM), which 
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l 
1+ exp(-2BA,) 


0 


Fic. 7.3. The probability that the neuron 7 becomes excited. 


coincides with the above-mentioned transition probability of neuron update to 


Soak: 


of (4.93), the right hand side vanishes, and the equilibrium distribution does 
not change with time as expected. It is also known that, under the kinetic Ising 
model, the state of a system approaches the equilibrium Gibbs~Boltzmann distri- 
bution at an appropriate temperature G~' = T even when the initial condition is 
away from equilibrium. Thus the problem of memory retrieval after a sufficiently 
long time starting from an initial condition in a neural network can be analysed 
by equilibrium statistical mechanics of the Ising model (7.7) with random in- 
teractions specified by the Hebb rule (7.4). The model Hamiltonian (7.7) with 
random patterns embedded by the Hebb rule (7.4) is usually called the Hopfield 
model (Hopfield 1982). 


7.2 Embedding a finite number of patterns 

It is relatively straightforward to analyse statistical-mechanical properties of the 
Hopfield model when the number of embedded patterns p is finite (Amit et al. 
1985). 

7.2.1 Free energy and equations of state 


The partition function of the system described by the Hamiltonian (7.7) with 
the Hebb rule (7.4) is 


i 3 ` 
Z = Trexp (Fi 2 set ; (7.9) 


where Tr is the sum over S. The effect of the vanishing diagonal (Jj; = 0) has 
been ignored here since it is of lower order in N. By introducing a new integration 
variable m” to linearize the square in the exponent, we have 


p 
( ere e : 
Z= Tr | I] dm“ exp {58 > m? + > My ) See 
pol H H i 
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= i I] dm” exp —5 Nom? + P log(2 cosh Bm - €;) ?, (7.10) 
aa 7 


where m = *(m!,...,m?), & = ‘*(é},...,€?), and the overall multiplicative 
constant has been omitted as it does not affect the physical properties. 

In consideration of the large number of neurons in the brain (about 10+}) and 
also from the viewpoint of designing computational equipment with highly non- 
trivial information processing capabilities, it makes sense to consider the limit 
of large-size systems. It is also interesting from a statistical-mechanical point of 
view to discuss phase transitions that appear only in the thermodynamic limit 
N — oo. Hence we take the limit N — oo in (7.10), so that the integral is 
evaluated by steepest descent. The free energy is 


: T 
f= 5m? -— W 5 log(2 cosh Gm - &;). (7.11) 
i 


The extremization condition of the free energy (7.11) gives the equation of state 


1 ; 
m= 2 é: tanh (8m - &;). (7.12) 


In the limit of large N, the sum over i in (7.12) becomes equivalent to the average 
over the random components of the vector € = '(€',...,€?) (E! = 41,...,€@ = 
+1) according to the self-averaging property. This averaging corresponds to the 
configurational average in the theory of spin glasses, which we denote by |---|, 
and we write the free energy and the equation of state as 


f= im? — T [log(2 cosh Am - £)| (7.13) 
m = |E tanh bm £]. (7.14) 


The physical significance of the order parameter m is revealed by the saddle- 
point condition of the first expression of (7.10) for large N: 


1 
m! = = 7 S£; (7.15) 


This equation shows that m* is the overlap between the uth embedded pattern 
and the state of the system. If the state of the system is in perfect. coincidence 
with the pth pattern (S; = €",Vi), we have m“ = 1. In the total absence of 
correlation, on the other hand, S; assumes +1 independently of £”, and conse- 
quently m = 0. It should then be clear that the success of retrieval of the uth 
pattern is measured by the order parameter m”. 


7.2.2 Solution of the equation of state 


Let us proceed to the solution of the equation of state (7.14) and discuss the 
macroscopic properties of the system. We first restrict ourselves to the case of a 
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f 


Fic. 7.4. Free energy of the Hopfield model when a single pattern is retrieved. 


single-pattern retrieval. There is no loss of generality by assuming that. the first 
pattern is to be retrieved: mı = m,mg = +--+: = Mp = 0. Then the equation of 
state (7.14) is 

m = [Et tanh(@mé')] = tanh Gm. (7.16) 


This is precisely the mean-field equation of the usual ferromagnetic Ising model, 
(1.19) with h = 0, which has a stable non-trivial solution m Æ 0 for T = 87! <1. 
The free energy, accordingly, has minima at m 4 0 if T < 1 as shown in Fig. 7.4. It 
is seen in this figure that an initial condition away from a stable state will evolve 
towards the nearest stable state with gradually decreasing free energy under 
the dynamics of §7.1.3 if the uncertainty parameter of the neuron functioning T 
(temperature in the physics terms) is smaller than 1. In particular, if T = 0, then 
the stable state is m = +1 from (7.16), and a perfect retrieval of the embedded 
pattern (or its complete reversal) is achieved. It has thus been shown that the 
Hopfield model, if the number of embedded patterns is finite and the temperature 
is not very large, works as an associative memory that retrieves the appropriate 
embedded pattern when a noisy version of the pattern is given as the initial 
condition. 

The equation of state (7.14) has many other solutions. A simple example is 
the solution to retrieve | patterns simultaneously with the same amplitude 


M = (I, Mi Ml, xa Mi Ona 0): (7.17) 


It can be shown that the ground state (T = 0) energy E; of this solution satisfies 
the following inequality: 
Ei < Ea < Es <.... (7.18) 


The single-retrieval solution is the most stable one, and other odd-pattern re- 
trieval cases follow. All even-pattern retrievals are unstable. At finite tempera- 
tures, the retrieval solution with l = 1 exists stably below the critical tempera- 
ture J, = 1 as has been shown already. This is the unique solution in the range 
0.461 < T < 1. The | = 3 solution appears at T = 0.461. Other solutions with 
l = 5,7,... show up one after another as the temperature is further decreased. 
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Solutions with non-uniform components in contrast to (7.17) also exist at low 
temperatures. 


7.3 Many patterns embedded 


We next investigate the case where the number of patterns p is proportional to 
N (Amit et al. 1987). As was mentioned at the end of the previous section, the 
equation of state has various solutions for finite p. For larger p, more and more 
complicated solutions appear, and when p reaches O(N), a spin glass solution 
emerges, in which the state of the system is randomly frozen in the sense that the 
state has no correlation with embedded patterns. If the ratio a = p/N exceeds a 
threshold, the spin glass state becomes the only stable state at low temperatures. 
This section is devoted to a detailed account of this scenario. 


7.3.1 Replicated partition function 

It is necessary to calculate the configurational average of the nth power of the 
partition function to derive the free energy F = —T [log Z] averaged over the 
pattern randomness {&/"}. It will be assumed for simplicity that only the first 
pattern (u = 1) is to be retrieved. The configurational average of the replicated 
partition function is, by (7.4) and (7.7), 


[Z”] = | Trexp sy LL Lee sss 
ij u p=l 
= ws f [Jam exp 4 GN | ~- Py (mi)? to LEE m Lest 
2 p Z2 p 
-SEE mEes) PY. (7.19) 
p P i 


It is convenient to separate the contribution of the first pattern from the rest. 


7.3.2 Non-retrieved patterns 


The overlap between the state of the system and a pattern other than the first 
one (u > 2) is due only to coincidental contributions from the randomness of 
{€f}, and is 


mi = HES o( z) (7.20) 


The reason is that, if €“ and a assume 1 or —1 randomly and independently, the 
stochastic variable $7, €S? has average 0 and variance N, since (3°, € Sp)? = 


N+ Viz; Ei E S25? = N. In the case of finite p treated in the previous section, 
the number af devia with u > 2 in (7.19) is finite, and we could ignore these con- 
tributions in the limit of large N. Since the number of such terms is proportional 
to N in the present section, a more careful treatment is needed. 
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The first step is the rescaling of the variable mf — m} /vVØN to reduce mf to 
O(1). We then perform configurational averaging |---] for the terms with > 2 
in (7.19) to find 


exp 5 Sey + > log cosh (y 2 > gst) l (7.21) 
p 


Hp ip 


The sum over p is for the range u > 2. In the limit of large N, we may expand 
log cosh(-) and keep only the leading term: 


1 
exp (- 5 os 5 mk wont) ; (7.22) 


u po 
where 


” w B OOS «€ 
Koo = Sp = Ș Pee (7.23) 


The term (7.22) is quadratic in m4, and the integral over m% can be carried 
out by the multivariable Gaussian integral. The result is, up to a trivial overall 


constant, 


yA p= 1. 1 aN 
(det K) (p 1)/2 — exp (- 5 trlog K) = f Th aap (1m — Est" 


(pc) 


oe {Pot log{(1 — 8) - soy} (7.24) 


Here (po) is the set of n(n — 1) replica pairs. We have also used the fact that 
the diagonal element Kpp of the matrix K is equal to 1 — 8 and the off-diagonal 
part is 


B 
Ko =~ S SES = —Bape- (7.25) 
i 
The expression Q is a matrix with the off-diagonal elements qpo and 0 along the 
diagonal. The operation Tr, in the exponent is the trace of the n x n matrix. 


Fourier representation of the delta function in (7.24) with integral variable rpo 
and insertion of the result into (7.19) yield 


| or s N 
[Z"] = Tr J [[ am) | II dpc? po EXP -708 ` Pho Gao 


p “ (po) (eo) 
(32 p— 1., ; : 
+ E rests epf -” = Trn log{(1 — 6) -6Q}} 


i, (po) 


e eE. m l 
p a 


p 
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7.3.3 Free energy and order parameter 


The S;-dependent parts in (7.26) are written as 


Tr exp >> mg SP + 508° SS Pago Oe 


ip i (po) 


i ; , Tochi 
= |exp 5 log Trexp | 8 I mie, SP + 508 5 Too D OS 
i p 


(pa) 


S lo 
wae ry £ wi cl ap (a2 Ta Pac my 
= exp N |log Trexp [| 2 > mE 8? oP 508 y Too 8. S ; (7.27) 
p (po) 
In going from the second expression to the last one, we have used the fact that 
the sum over i is equivalent to the configurational average |---| by self-averaging 
in the limit of large N. Using this result in (7.26), we obtain 


[Z"] = J I] dmp J | II OF pg I] dig 


p 2 Q, ; 7 Lo oy 
-exp N s 3 my E Tr, log{(1 — 8M — BQ} — z 5 T po Ipo 


p (pa) 


+ | log Tr exp 508? 5 Too SPST + bm el ge ; (7.28) 


(pc) 


where we have set (p — 1)/N = a assuming N,p > 1. In the thermodynamic 
limit N —> oo, the free energy is derived from the term proportional to nø in the 
exponent: 


an n y) Tg zag Tt» log{(1 — 8) — BQ} 


+—— ae TpaIpa ~ in ‘log Tr efs], (7.29) 
(pF) 


where GH¢ is the quantity in the exponent after the operator log Tr in (7.28). 
The order parameters have the following significance. The parameter mf is 
the overlap between the state of the system and the pth pattern. This can be 


confirmed by extremizing the exponent of (7.19) with respect to m% to find 
1 i 
m= S EnS. (7.30) 
i 


Next, qag is the spin glass order parameter from comparison of (7.24) and (2.20). 
As for rpo, extremization of the exponent of (7.26) with respect to gag gives 
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1 
CA ) moms. (7.31) 
pad 


To calculate the variation with respect to the components of Q in the Tra log 
m” of (7.22) with Kyo replaced by (1 — 8)J — 8Q and that we have performed 
scaling of m4 by VØN just before (7.22). From (7.31), rpc is understood as the 
sum of effects of non-retrieved patterns. 


7.3.4 Replica-symmetric solution 


The assumption of replica symmetry leads to an explicit form of the free energy 
(7.29). From mi = M, qoo = q, and Too =r for p Æ g, the first term of (7.29) is 
m?/2 and the third term in the limit n — 0 is —a8rq/2. To calculate the second 
term, it should be noted that the eigenvectors of the matrix (1 — AM ~ BQ are, 
first, the uniform one ‘(1,1,...,1) and, second, the sequence of the nth roots of 
unity ‘(e mE/n efnik/m.  ., eAnmiik/n (k= 1, 8,...,2 L): The eigenvalue of 
the first eigenvector is 1 — 6+ Gq — npq without degeneracy, and for the second 
eigenvector, it is 1 — 8 + 8q (degeneracy n — 1). Thus in the limit n — 0, 


= Trn log{(1 — 8)I — BQ} = ~ log(1 — 8 + Bq —nGq) + a : log(1 — 8+ 8q) 
= log(] = (6 40g) = en (7.32) 
1-8 + fq 


The final term of (7.29) is, as n — 0, 
L245 ee ee 
ios Tr exp (500 pe SP)" — 308 rm — Aa SP 


log Tr / Dz exp (svay S£ — ames") 
i p p 


si — Sard -+ | j Dz log 2 cosh G(./arz + ms)| i (7.33) 


i 


1 
— arp? + = 
2 n 


The integral over z has been introduced to reduce the quadratic form (3° gor ‘a 
to a linear expression. Collecting everything together, we find the RS free energy 


l. / estes ot 
f==m' + Bae (ga — 3 + Bq) —- 


a tk. Zipli 
2 28 ipo a AN 


Bq aß 
2 


-T f Dz [log 2 cosh 8(varz + m£)] . (7.34) 


The equations of state are derived by extremization of (7.34). The equation 
for m is 
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m = | Dz [£ tanh (varz + mé)] = i Dz tanh 8( varz +m). (7.35) 
Next, a slight manipulation of the extremization condition with respect to r gives 


g= Dz [tanh? B(Varz + m€)] = ; Dz tanh? @(./arz +m), (7.36) 


and from extremization by q, 


os q 
"= 0 —B+8q?” 


It is necessary to solve these three equations simultaneously. 

It is easy to check that there is a paramagnetic solution m = q =r = 0. 
At low temperatures, the spin glass solution m = 0,q > 0,r > 0 exists. The 
critical temperature for the spin glass solution to appear is T = 1+ ya, as can 
be verified by combining (7.37) and the leading term in the expansion of the 
right hand side of (7.36) with m = 0. 

The retrieval solution m > 0,q > 0,r > 0 appears discontinuously by a first- 
order transition. Numerical solution of the equations of state (7.35), (7.36), and 
(7.37) should be employed to draw phase boundaries. !* 

The final phase diagram is shown in Fig. 7.5, For œ smaller than 0.138, 
three phases (paramagnetic, spin glass, and metastable retrieval phases) appear 
in this order as the temperature is decreased. If a < 0.05, the retrieval phase 
is stable (i.e. it is the global, not local, minimum of the free energy) at low 
temperatures. The RS solution of the retrieval phase is unstable for RSB at very 
low temperatures. However, the RSB region is very small and the qualitative 
behaviour of the order parameter m and related quantities is expected to be 
relatively well described by the RS solution. 


(7.37) 


7.4 Self-consistent signal-to-noise analysis 


The replica method is powerful but is not easily applicable to some problems. For 
instance, when the input-output relation of a neuron, which is often called the 
activation function, is not represented by a simple monotonic function like (7.3), 
it is difficult (and often impossible) to use the replica method. Self-consistent 
signal-to-noise analysis (SCSNA) is a convenient approximation applicable to 
many of such cases (Shiino and Fukai 1993). 


7.4.1 Stationary state of an analogue neuron 


The idea of SCSNA is most easily formulated for analogue neurons. The mem- 
brane potential of a real neuron is an analogue quantity and its change with time 
is modelled by the following equation: 


dh; a 
q =t 3 Jij F (hy). (7.38) 


15 Analytical analysis is possible in the limits a — 0 and T — 0. 
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metastable 


0 0.05 0.138 


Fic. 7.5. Phase diagram of the Hopfield model. The dashed line near the a axis 
is the AT line and marks the instability of the RS solution. 


The first term on the right hand side is for a natural decay of h;, and the second 

term is the input signal from other neurons. F(A) is the activation function of 

an analogue neuron, Sj = F(h;) in the notation of (7.3). The variable S; has 

continuous values, and the connection Jj; is assumed to be given by the Hebb 

rule (7.4) with random patterns A = +1. The self-interaction J;; is vanishing. 
In the stationary state, we have from (7.38) 


hi = >> Jig 93. (7.39) 
J 
Let us look for a solution which has the overlap 
1 
m” = v2. GS; (7.40) 
j 


of order m! = m = O(1) and m” = O(1/VN) (u > 2) as in §§7.2 and 7.3. 


7.4.2 Separation of signal and noise 

To evaluate the overlap (7.40), we first insert the Hebb rule (7.4) into the defini- 
tion of h;, (7.39), to find the following expression in which the first and the pth 
patterns (u > 2) are treated separately: 


hi = Elm + Efm” + J Em” — asi. (7.41) 
Ul 
The last term —aS; is the correction due to Ju = 0. The possibility that a term 
proportional to S; may be included in the third term of (7.41) is taken into 
account by separating this into a term proportional to S; and the rest, 
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So fm" = 78; + zip. (7.42) 
væl, u 
From (7.41) and (7.42), 
Si = F(hi) = F(&im + Efm" + ziy + TS), (7.43) 


where we have written I = y — a. Now, suppose that the following solution has 
been obtained for S; from (7.43): 


Then the overlap for u > 2 is 
ay bt 1 th fel m” ATS 
aa: bes P(E;m + zju) + W SOF (EM + Zin) (7.45) 
j j 


where we have expanded the expression to first order in m4 = O(1/VN). The 
solution of (7.45) for m” is 


1 pa J 1 ry | Kd 
me = KN a cf (Em + jy), K =1- N is EF'(EM + zin). (7.46) 
j j 


Here we have dropped the -dependence of K in the second relation defining K, 
since Zją Will later be assumed to be a Gaussian random variable with vanishing 
mean and -independent variance. Replacement of this equation in the left hand 
side of (7.42) and separation of the result into the (j = i)-term and the rest give 


S gm = ee s Plem+2w)+ og D S ey P(m+ zy). (7.47) 


vA~L yu uel JAG UAL 


Comparison of this equation with the right hand side of (7.42) reveals that the 
first term on the right hand side of (7.47) is yS; and the second ziu. Indeed, 
the first term is (p/KN)S; = aS;/K according to (7.44). Small corrections of 
O(1/VN) can be ignored in this correspondence. From a$;/K = 7S;, we find 
y=a/K. 

The basic assumption of the SCSNA is that the second term on the right hand 
side of (7.47) is a Gaussian random variable z;„ with vanishing mean. Under the 
assumption that various terms in the second term on the right hand side of (7.47) 
are independent of each other, the variance is written as 


‘ ‘ 1 = ¢ Oma 2 
SG = ane So SS (F(Gjm4t 2y)?) = pË (Elm +2z)*)e2. (7.48) 
ji vl 


Here (---)¢,z is the average by the Gaussian variable z and the random variable 
€ (= +1). The above equation holds because in the large-N limit the sum over j 
(and division by N) is considered to be equivalent to the average over € and z. 
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A similar manipulation to rewrite (7.46) yields 
K=1-(F’(ém+2z))es. (7.49) 
Finally, m is seen to satisfy 
m = (EF (Em + z))e,2. (7.50) 
Solution of (7.48), (7.49), and (7.50) determines the properties of the system. 


7.4.3 Equation of state 


Let us rewrite (7.48), (7.49), and (7.50) in a more compact. form. Equation (7.41) 
is expressed as a relation between random variables according to (7.42): 


won py or VA = — foie i 5 
h = Em +z +TY (E, z) (r K a) ; (7.51) 
where Y (£, z) = F(Em +z -+TY (£, z)). Introducing new symbols 
K? 2 
q= a , var =o, U = 1 — K, to = z, (7.52) 


we have the following equations in place of (7.50),(7.48), and (7.49) 
mes ( [D0 v2) (7.53) 
q = ( | Dalea), (7.54) 
Uvar = (| Dray (€a), (7.55) 


and auxiliary relations 


Y(é,2) = F(ém+ Varz+TY(é,2)), T= oe q=(1~U)*r. (7.56) 


74.4 Binary neuron 
Let us exemplify the idea of the SCSNA using the case of a binary neuron 
F(a) = sgn(a). For an odd function F(a), it is seen from (7.56) that Y(~—1, x) = 


m,g,U,/ar and replace the integrands by their values at € = 1. 
The equation for Y, (7.56), reads 
Y(z) = sgn(m+ J/orz+TY(z)). (7.57) 


The stable solution of this equation is, according to Fig. 7.6, Y (x) = 1 when 
Jarx+m > 0 and is Y(z) = —1 for yars +m <0. When two solutions exist 
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-1 
sgn (Y) 
-Nvorx+m >) 


ae LONER X +n <0 
A 


“7 0 oo Y. 


Fic. 7.6. Solutions of the equation for Y 


simultaneously as in Fig. 7.6, one accepts the one with a larger area between 
vargz+m and the Y axis, corresponding to the Maxwell rule of phase selection. 
From this and (7.54), we have q = 1, and from (7.53) and (7.55), 


mj ar ae 
m= 2 | De, Jar = ya + ze" mak (7.58) 
0 

The solution obtained this way agrees with the limit T — 0 of the RS solution 
(7.35)~-(7.37). This is trivial for m and q. As for r, (7.56) coincides with (7.37) 
using the correspondence U = G(1 — q). 

The SCSNA generally gives the same answer as the RS solution when the 
latter is known. The SCSNA is applicable also to many problems for which the 
replica method is not easily implemented. A future problem is to clarify the 
limit of applicability of the SCSNA, corresponding to the AT line in the replica 
method. 


7.5 Dynamics 


It is an interesting problem how the overlap m changes with time in neural 
networks. An initial condition close to an embedded pattern u would develop 
towards a large value of m, at T = 0. It is necessary to introduce a theoret- 
ical framework beyond equilibrium theory to clarify the details of these time- 
dependent phenomena. 

It is not difficult to construct a dynamical theory when the number of em- 
bedded patterns p is finite (Coolen and Ruijgrok 1988; Riedel et al. 1988; Shiino 
et al. 1989). However, for p proportional to N, the problem is very complicated 
and there is no closed exact theory to describe the dynamics of the macroscopic 
behaviour of a network. This aspect is closely related to the existence of a compli- 
cated structure of the phase space of the spin glass state as mentioned in §7.3; it 
is highly non-trivial to describe the system properties rigorously when the state 
of the system evolves in a phase space with infinitely many free energy minima. 
The present section gives a flavour of approximation theory using the example 
of the dynamical theory due to Amari and Maginu (1988). 
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7.5.1 Synchronous dynamics 


Let us denote the state of neuron i at time t by Si (= +1). We assume in the 
present section that t is an integer and consider synchronous dynamics in which 
all neurons update their states according to 


Siti = sgn S Jys (7.59) 


I 


simultaneously at each discrete time step t (= 0,1,2,...). In other words, we 
apply (7.59) to all į simultaneously and the new states thus obtained {S!*'} are 
inserted in the right hand side in place of {St} for the next update.!® Synchronous 
dynamics is usually easier to treat than its asynchronous counterpart. Also it is 
often the case that systems with these two types of dynamics share very similar 
equilibrium properties (Amit et al. 1985; Fontanari and Koberle 1987). 


7.5.2 Time evolution of the overlap 


Let us investigate how the overlap m, changes with time under the synchronous 
dynamics (7.59) when the synaptic couplings Jij are given by the Hebb rule (7.4) 
with J;; = 0. The goal is to express the time evolution of the system in terms of a 
few macroscopic variables (order parameters). It is in general impossible to carry 
out this programme with only a finite number of order parameters including the 
overlap (Gardner et al. 1987). Some approximations should be introduced. In the 
Amari-~Maginu dynamics, one derives dynamical equations in an approximate 
but closed form for the overlap and the variance of noise in the input signal. 

Let us consider the case where the initial condition is close to the first pattern 
only, mı = m > 0 and m, = 0 for p Æ 1. The input signal to S; is separated 
into the true signal part (contributing to retrieval of the first pattern) and the 
rest (noise): 


| ee! ieee : 
hie S$” gS) = x Soy eas 
jet wal jži 
1 ieot iV t 
ie De ei py aa S 
ji MAL j#t 
= lm, + N}. (7.60) 


The term with u = 1 gives the true signal €/m;,, and the rest is the noise N$. 
Self-feedback JSt vanishes and is omitted here as are also terms of O(N~'). 
The time evolution of the overlap is, from (7.59) and (7.60), 


16We implicitly had asynchronous dynamics in mind so far in which (7.59) is applied to a 
single į and the new state of this ith neuron, CH is inserted in the right hand side with all 
other neuron states (S; (j 4 i)) unchanged. See Amit (1989) for detailed discussions on this 
point. 
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1 1 Iar h 
mi = y > E}sen(hi) = a D sgn(m: + EL Nf). (7.61) 


i 


If the noise term £} Nt is negligibly small compared to m, then the time de- 
velopment of m; is described by mi41 = sgn(m;), which immediately leads to 


mı = 1 or m, = —1 at the next time step t = 1 depending on the sign of the 
initial condition m;..9. Noise cannot be ignored actually, and we have to consider 


its effects as follows. 

Since NÉ is composed of the sum of many terms involving stochastic variables 
Ef, this N} would follow a Gaussian distribution according to the central limit 
theorem if the terms in the sum were independent. of each other. This indepen- 
dence actually does not hold because S% in Ný has been affected by all other {&/"} 
during the past update of states. In the Amari~Maginu dynamics, one neverthe- 
less accepts the approximate treatment that Nt follows a Gaussian distribution 
with vanishing mean and variance o7, denoted by N(0,¢7). The appropriateness 
of this approximation is judged by the results it leads to. 

If we assume that Nt obeys the distribution N (0,07), the same should hold 
for €} Nf. Then the time development of m; can be derived in the large-N limit 
from (7.61) as 

Mit 


Mipi = J Du sgn(mi + ou) = F (=) (7.62) 
t 


where F(x) = 2 f Du. 


7.5.3 Time evolution of the variance 


Our description of dynamics in terms of macroscopic variables m, and o, becomes 
a closed one if we know the time evolution of the variance of noise o? since we 
have already derived the equation for m in (7.62). We may assume that the 
first pattern to be retrieved has ¿l = 1 (Vi) without loss of generality as can be 
checked by the gauge transformation S; > 5;€}, the ferromagnetic gauge. 

With the notation Ef- --] for the average (expectation value), the variance of 
Nt is written as 


$ A] rty2 1 ; ; A Vekter ot ot se 
of = EVN] = 33 DD DD ERLER ESE S557]. (7.63) 


Ml sel GAs i Fi 


The expectation value in this sum is classified into four types according to the 
combination of indices: 
1. For p = v and j = 7’, Ef] = 1. The number of such terms is (p--1)(N—1). 
2. For u Av and j = j’, E|---] = 0, which can be neglected. 
3. Let us write v3 = EEF Ei S4 S$] for the contribution of the case = v and 
j #7’. The number of terms is (p — 1)(N — 1)(N — 2). 
4. When u # v and j # j’, we have to evaluate v4 = EEPE ES En S495] 
explicitly. The number of terms is (p — 1)(p — 2)(N — 1)(N — 2). 
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For evaluation of v3, it is convenient to write the z ; El, -dependence of Si Si, 
explicitly. According to (7.60), 


Si = sen(my-1 + Q + E&R -+ Nee, oe 
(7.64) 


t = : ta gk ye Let el ot~1 
Si = sgn(me—1 Q E&R | N E; ey 5 


where Q, Q’, and R are defined by 
ate 1 Vel gtl rla 1 v er otel p l H otel 
Q= 7 De De Sh Qa a a dk Sh k= a Ds Sek 
væl, u k#j ve lp hej kee j J! 
(7.65) 


We assume these are independent Gaussian variables with vanishing mean and 
* 2 p * 6 j ~ . . 
variance o7_,,07_,, and o?_,/p, respectively. By writing Y;; for the contribution 

to va when & == Ef = 1 and similarly for Y;_1, Y_11, Y-1—1, we have 


1 PA 
Ug = qin + Yii H You + Y-1-1). (7.66) 
The sum of the first two terms is 


Yii + Yii = [PPQP dQ dQ’ dR 
St gt-1 
. {se (re a Q + R+ | sgn (m: a Q' a R+ 7 
t1 
j' 


si~! 
— sgn (re +Q+R- N sgn (m +Q’-R- ~ ) | . (7.67) 


The Gaussian integral over Q and Q’ gives the function F used in (7.62). The 
integral over R can be performed by expanding F to first order assuming that 
R£ GF /N is much smaller than m- in the argument of F. The remaining 
Y_11 + Y—1—1 gives the same answer, and the final result is 


> 2 m Amt—1 m? ) : (ee 
Nuz = —— exp | ~-a |] + exp | -z5 |] F| — Oe 7.68 
A ( Oy ) Ja \ 207-4 Ot vor) 


A similar calculation yields for v4 


2m? m2 
e Fen t—] sf 
N?u4 = a exp | — ; Dl (7.69) 
NOF Timi 


We have thus obtained o?. To be more accurate in discussing the variance, it is 
better to subtract the square of the average from the average of the square. The 
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Fic. 7.7. Memory retrieval by synchronous dynamics: (a) a = 0.08, and (b) 
a = 0.20 


P = G) ‘ $ ' A 4,8 + + 
relation (E|N!])* = p?°v4 coming from the definition of v4 can be conveniently 


i 
a 


used in this calculation: 


2 / m? LAMM ( m2. ) 
2 t—1 Mehi Fal 

O$ =Q + ~ exp ( — A exp | — : 7.70 
t+1 " p ( o2 ) OnO: p 202, ( ) 


The two equations (7.62) and (7.70) determine the dynamical evolution of the 
macroscopic parameters m and og. 


7.5.4 Limit of applicability 

We show some explicit solutions of the time evolution equations (7.62) and (7.70). 
Figure 7.7 depicts my for various initial conditions with œ = 0.08 (a) and 0.20 (b). 
When a = 0.20, the state tends to ms — 0 as t — o0 for any initial conditions, a 
retrieval failure. For a = 0.08, m, approaches 1, a successful retrieval, when the 
initial condition is larger than a threshold mo > moc. These analytical results 
have been confirmed to agree with simulations, at least qualitatively. 

Detailed simulations, however, show that the assumption of Gaussian noise 
is close to reality when retrieval succeeds, but noise does not obey a Gaussian 
distribution when retrieval fails (Nishimori and Ozeki 1993). The system moves 
towards a spin glass state when retrieval fails, in which case the phase space has 
a very complicated structure and the dynamics cannot be described in terms of 
only two variables m; and cg. 

For a larger than a critical value œc, the system moves away from the embed- 
ded pattern even if the initial condition is exactly at the pattern mo = 1. Numer- 
ical evaluation of (7.62) and (7.70) reveals that this critical value is a, = 0.16. 
This quantity should in principle be identical to the critical value œe = 0.138 for 
the existence of the retrieval phase at T = 0 in the Hopfield model (see Fig. 7.5). 
Note that equilibrium replica analysis has been performed also for the system 
with synchronous dynamics under the Hebb rule (Fontanari and Köberle 1987). 
The RS result a, = 0.138 is shared by this synchronous case. It is also known 
that the RSB changes a, only by a very small amount (Steffan and Kiihn 1994). 

Considering several previous steps in deriving the time evolution equation is 
known to improve results (Okada 1995), but this method is still inexact as it uses 
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Fic. 7.8. Simple perceptron 


a Gaussian assumption of noise distribution. It is in general impossible to solve 
exactly and explicitly the dynamics of associative memory with an extensive 
number of patterns embedded (p proportional to N). Closed-form approxima- 
tion based on a dynamical version of equiparitioning gives a reasonably good 
description of the dynamical evolution of macrovariables (Coolen and Sherring- 
ton 1994). Another approach is to use a dynamical generating functional to derive 
various relations involving correlation and response functions (Rieger et al. 1989; 
Horner et al. 1989), which gives some insight into the dynamics, in particular in 


7.6 Perceptron and volume of connections 


It is important to investigate the limit of performance of a single neuron. In 
the present section, we discuss the properties of a simple perceptron with the 
simplest possible activation function. The point of view employed here is a lit- 
tle different from the previous arguments on associative memory because the 
synaptic couplings are dynamical variables here. 


7.6.1 Simple perceptron 

A simple perceptron is an element that gives an output o” according to the follow- 
ing rule: given N inputs €/',€5,...,€4, and synaptic connections Ji, J2,..., JN 
(Fig. 7.8) then 


g” = sen ` JEF -0 |. (7.71) 
j 


pattern to be realized by the perceptron. The perceptron is required to adjust 
the weights J = {J;} so that the input {£f h=1,...,ẹ leads to the desired output 
a”. We assume oy of = +1; then the task of the perceptron is to classify p input 
patterns into two classes, those with o” = 1 and those o” = —1. 

Let us examine an example of the case N = 2. By choosing Ji, Jo, and 0 
appropriately as shown in Fig. 7.9, we have 55 j Tigh -— @ = 0 on the dashed line 
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Fic. 7.9. Linearly separable task. The arrow is the vector J = tJ, Jo). 


perpendicular to the vector J = '(J;, J2). This means that the output is o! = 1 
for the input ét = £} = 1 denoted by the full circle and o” = —1 for the open 
circles, thus separating the full and open circles by the dashed line. However, 
if there were two full circles at (1,1) and (—1,—1) and the rest were open, 
no straight line would separate the circles. It should then be clear that a simple 
perceptron is capable of realizing classification tasks corresponding to a bisection 
of the {€!"} space by a hypersurface (a line in the case of two dimensions). This 
condition is called linear separability. 


7.6.2  Perceptron learning 


Although it has the constraint of linear separability, a simple perceptron plays 
very important roles in the theory of learning to be discussed in the next chapter. 
Learning means to change the couplings J gradually, reflecting given examples of 
input and output. The goal is to adjust the properties of the elements (neurons) 
or the network so that the correct input-output pairs are realized appropriately. 
We shall assume 9 = 0 in (7.71) for simplicity. 

If the initial couplings of a perceptron are given randomly, the correct input- 
output pairs are not realized. However, under the following learning rule (the rule 
to change the couplings), it is known that the perceptron eventually reproduces 
correct input-output pairs as long as the examples are all linearly separable and 
the number of couplings N is finite (the convergence theorem): 


0 (correct output for the uth pattern) 


nor & (otherwise). (7.72) 


Jj(t + At) = Jj(t) + i 


This is called the perceptron learning rule. Here 7 is a small positive constant. We 
do not present a formal proof of the theorem here. It is nevertheless to be noted 
that, by the rule (7.72), the sum of input signals $` j PRA increases by no“ N 
when the previous output was incorrect, whereby the output sgn(>°> j JE) is 
pushed towards the correct value o”. 
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7.6.3 Capacity of a perceptron 


An interesting question is how many input-output pairs can be realized by a 


by p. Since we are interested in the limit of large N, the following normalization 
of the input signal will turn out to be appropriate: 


1 
o! = sgen | —— 5 LE |. 7.73 
g AN 2i jS} ( ) 


The couplings are assumed to be normalized, si J? =N. 

As we present examples (patterns or tasks to be realized) 4 = 1,2,3, ... to the 
simple perceptron, the region in the J-space where these examples are correctly 
reproduced by (7.73) shrinks gradually. Beyond a certain limit of the number of 
examples, the volume of this region vanishes and no J will reproduce some of 
the tasks. Such a limit is termed the capacity of the perceptron and is known to 
be 2N. We derive a generalization of this result using techniques imported from 
spin glass theory (Gardner 1987, 1988). 

The condition that the pth example is correctly reproduced is, as is seen by 
multiplying both sides of (7.73) by o“, 


H 
A“ = a Yg > 0. (7.74) 
j 


We generalize this inequality by replacing the right hand side by a positive 
constant s 


AY > k (7.75) 


and calculate the volume of the subspace in the J-space satisfying this relation. 
If we use (7.74), the left hand side can be very close to zero, and, when the input 
deviates slightly from the correct {&}, A’ might become negative, producing 
an incorrect output. In (7.75) on the other hand, a small amount of error (or 
noise) in the input would not affect the sign of A” so that the output remains 
intact, implying a larger capability of error correction (or noise removal). 

Let us consider the following volume (Gardner volume) in the J-space satis- 
fying (7.75) with the normalization condition taken into account: 


V= F [475 J? - N) |] 0(44 -«), Vo = | aro J? —N), 
j j H j j 
(7.76) 
where O(x) is a step function: O(a) = 1 for x > 0 and 0 for z < 0. 
Since we are interested in the typical behaviour of the system for random 


of the extensive quantity log V over the randomness of £f and o”, similar to the 
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spin glass theory.'” For this purpose we take the average of V” and let n tend 
to zero, following the prescription of the replica method, 


T 1 4 tX . (a 
pis gli Sue’ -N | ][e TRUS - 


“ Ae j X, |b 
(7.77) 
7.6.4 Replica representation 


To proceed further with the calculation of (7.77), we use the integral represen- 
tation of the step function 


(y =K) = m =f. dx et Ach (7.78) 


and carry out the average [---] over {Ef}: 


“ H r phh 
[e W” & 


Xh 


=a [[ ae? f Tees IRAE + X lags (5 a 


© ypu E ap a jh 


SN Tarr f L [a~ 


-exp pa -=> D -5 agua? . (7.79) 


(aß) 


Here we have dropped the trivial factor (a power of 27), set gag = $, FF Bd /N, 
and used the approximation log cos(x) =% ~a?/2 valid for small x. The notation 
(aß) is for the n(n ~ 1)/2 different replica pairs. Apparently, gag is interpreted 
as the spin glass order parameter in the space of couplings. 

Using the normalization condition 5°, 445 ? = N and the Fourier representation 
of the delta function to express the dentition of dag, we find 


Man / | [ daasdFas I] dEo [Last 79)} exp {wy Es 


(aß) 
-iN $ Fopdas + pe da Ota \ Fag Y JEJ? 
(a) (af) j 


17In the present problem of capacity, El and o are independent random numbers. In the 
theory of learning to be discussed in the next chapter, o” is a function of {€/}. 
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= V gi I] ddag dFas | [aza eñ (7.80) 


(af) 


Here 
G = aGi(qog) + G2(Fap, Ea) -iX Ea —i >> Fopdas (7.81) 
& (ag) 
with 


Gi(daa) = loz [ He f [ [a 
oa i a 
1 e 6 
-exp DD =a Soy — ` n x (7.82) 
(84 OF 


(aß) 


Ga(Fas, Ea) =10g8 | [Tay 
Fm OO ey 


“exp i{ Be (IP YS Ppd Ty (7.83) 
(a) 


The reader should not confuse the factor a (= p/N) in front of Gi (qag) in (7.81) 
with the replica index. 


7.6.5 Replica-symmetric solution 


The assumption of replica symmetry gag = q, Fag = F, Ea = E makes it possible 
to go further with the evaluation of (7.82) and (7.83). We write the integral in 
(7.82) as J, which is simplified under replica symmetry as 


n= f e f [arep (Zane 15" P- 40e) 


s 
=H J [2° [ ‘Tle [ove (g-i a eel 


14? + iz (A + wi) E (7.84) 


=] Dyd a dz 3 aep- 


The quantity in the final braces {---}, to be denoted as L(q), is, after integration 
over £, 


oe. (A+Y an we, | Stuvi : 
Llo = | dà —s exp ( y = 2 Erfe | ——_—== |, 7.85 
a M a a T (1-4) K 
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es . . “Or HE pak n 
where Erfe(x) is the complementary error function f; e~ dt. Thus Gi (qag) in 
(7.82) is, in the limit n — 0, 


Gi(q) = n f Dy log L(q). (7.86) 


The integral over J in Go(F,E), on the other hand, can be evaluated by 
a multidimensional Gaussian integral using the fact that the exponent is a 
quadratic form in J.'* The eigenvalues of the quadratic form of J, E 3*,(J%)?+ 
Ey ap) IST P are E+(n—1)F/2 (without degeneracy) and E—F/2 (degeneracy 
(n —1)), from which Gs is, apart from a trivial constant, 


1 . 1—1 i= a 
Golf, E) = -z log (e+ A F) == a Diga (z- =) 


n F nF i 
wel coe ome LOD E — p . 87 
5 lo ( =) 4E — 2F seat) 


Substitution of (7.86) and (7.87) into (7.81) gives 


1 1 _ oF F Pree JE l 
1a =a f Dy log L(a) - 3 log ( £ 4 iB DoF iE + 5 FG. (7.88) 


According to the method of steepest descent, E and F can be eliminated in the 
limit N — oo by extremization of G with respect to E and F: 


, il-24) n ig l 
A oi eee CER 7.89 
Xg? aq ee) 


Then, (7.88) can be written only in terms of q, and the following expression 


results: i , i ; 
7G = a | Dy log L(q) + 5 log(1 — q) + NET (7.90) 


Various J are allowed for small p, but as p approaches the capacity limit, 
the freedom of choice is narrowed in the J-space, and eventually only a single 
one survives. Then q becomes unity by the definition NSE, J FJ 5 so that 
the capacity of the perceptron, œe, is obtained by extremization of (7.90) with 
respect to q and setting q mee” Using the limiting form of the complementary 
error function Erfe(x) œ e77 /2x as x — 00, 


ae f Duety (7.91) 
Ae a a OE ú 


The final result for the capacity is 


18 Another idea is to reduce the quadratic form in the exponent involving F<. Je? toa 
linear form by Gaussian integration. 
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Fic, 7.10. Capacity of a simple perceptron 


A= í f "ea wey (7.92) 


me 


In the limit x — 0, a.(0) = 2, which is the capacity 2N referred to in the first 
part of §7.6.3. The function ae(x) is monotonically decreasing as shown in Fig. 
7.10. We mention without proof here that the RS solution is known to be stable 
as long as a < 2 (Gardner 1987, 1988), which means that the J-space does not 
have a complicated structure below the capacity. 


Bibliographical note 


Most of the developments in neural network theory from statistical-mechanical 
points of view are covered in Amit (1989), Hertz et al. (1991), and Domany 
et al. (1991) as far as references until around 1990 are concerned. More up-to- 
date expositions are found in Coolen (2001) and Coolen and Sherrington (2001). 
Activities in recent years are centred mainly around the dynamical aspects of 
memory retrieval and the problem of learning. The latter will be treated in the 
next chapter. Reviews concerning the former topic are given in Coolen (2001), 
Coolen and Sherrington (2001), and Domany et al. (1995). 


8 
JING IN PERCEPTRON 


LEARI 


In the previous chapter we calculated the capacity of a simple perceptron under 
random combinations of input and output. The problem of learning is different 
from the capacity problem in that the perceptron is required to simulate the 
functioning of another perceptron even for new inputs, not to reproduce random 
signals as in the previous chapter. For this purpose, the couplings are gradually 
adjusted so that the probability of correct output increases. An important objec- 
tive of the theory of learning is to estimate the functional relation between the 
number of examples and the expected error under a given algorithm to change 
couplings. The argument in this book will be restricted to learning in simple 
perceptrons. 


8.1 Learning and generalization error 


We first explain a few basic notions of learning. In particular, we introduce the 
generalization error, which is the expected error rate for a new input not included 
in the given examples used to train the perceptron. 


8.1.1 Learning in perceptron 


Let us prepare two perceptrons, one of which is called a teacher and the other a 
student. The two perceptrons share a common input € = {€;} but the outputs 
are different because of differences in the couplings. The set of couplings of the 
teacher will be denoted by B = {B;} and that of the student by J = {J;} (Fig. 
8.1). All these vectors are of dimensionality N. The student compares its output 
with that of the teacher and, if necessary, modifies its own couplings so that 
the output tends to coincide with the teacher output. The teacher couplings do 
not change. This procedure is termed supervised learning. The student couplings 
J change according to the output of the teacher; the student tries to simulate 
the teacher, given only the teacher output without explicit knowledge of the 
teacher couplings. In the case of unsupervised learning, by contrast, a perceptron 
changes its couplings only according to the input signals. There is no teacher 
which gives an ideal output. The perceptron adjusts itself so that the structure 
of the input signal (e.g. the distribution function of inputs) is well represented 
in the couplings. We focus our attention on supervised learning. 

Supervised learning is classified into two types. In batch learning (or off-line 


these examples by the teacher are reproduced sufficiently well. Then the same 
process is repeated for additional examples. The student learns to reproduce the 
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teacher student 


Fic. 8.1. Teacher and student perceptrons with N = 3 


examples faithfully. A disadvantage is that the time and memory required are 
large. 

The protocol of on-line learning is to change the student couplings immedi- 
ately for a given example that is not necessarily used repeatedly. The student 
may not be able to answer correctly to previously given examples in contrast 
to batch learning. One can instead save time and memory in on-line learning. 
Another advantage is that the student can follow changes of the environment 
(such as the change of the teacher structure, if any) relatively quickly. 

In general, the teacher and student may have more complex structures than a 
simple perceptron. For example, multilayer networks (where a number of simple 
elements are connected to form successive layers) are frequently used in prac- 
tical applications because of their higher capabilities of information processing 
than simple perceptrons. Nevertheless, we mainly discuss the case of a simple 
perceptron to elucidate the basic concepts of learning, from which analyses of 
more complex systems start. References for further study are listed at the end 
of the chapter. 


8.1.2 Generalization error 


One of the most important quantities in the theory of learning is the generaliza- 
tion error, which is, roughly speaking, the expectation value of the error in the 
student output for a new input. An important goal of the theory of learning is 
to clarify the behaviour of generalization error as a function of the number of 
examples. 

To define the generalization error more precisely, we denote the input signals 
to the student and teacher by u and v, respectively, 


N N 
u=S Jib, v= >_ Bij. (8.1) 
j=l j=l 
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Suppose that both student and teacher are simple perceptrons. Then the outputs 
are sgn(u) and sgn(v), respectively. It is also assumed that the components of 
student and teacher couplings take arbitrary values under normalization $> j J 4 z 
5 j B; = N. We discuss the case where the components of the input vector are 
independent stochastic variables satisfying [€;| = 0 and [& | = 0i;/N. Other 
detailed properties of € (such as whether £; is discrete or continuous) will be 
irrelevant to the following argument in the limit of large N. The overlap between 
the student and teacher couplings is written as R: 


N 
1 5 i : 


As the process of learning proceeds, the structure of the student usually ap- 
proaches that of the teacher (J œ~ B) and consequently R becomes closer to 
unity. 

We consider the limit of large N since the description of the system then often 
simplifies significantly. For example, the stochastic variables u and v follow the 
Gaussian distribution of vanishing mean, variance unity, and covariance (average 
of uv) R when N — oo according to the central limit theorem, 


1 aids ( u? +y* — | 
N 2(1 — R?) 


The average with respect to this probability distribution function corresponds 
to the average over the input distribution. 

To define the generalization error, we introduce the training energy or training 
cost for p examples: 


Pu, v) = (8.3) 


p 
B= > Vape (8.4) 
pose] 
where o, = sgn(v,) is the teacher output to the uth input vector €. Note 
that the training energy depends on the teacher coupling B only through the 
teacher output g; the student does not know the detailed structure of the teacher 
(B) but is given only the output op. The generalization function E(J, o) is 
the average of V(J,o,€) over the distribution of input €. A common choice of 
training energy for the simple perceptron is the number of incorrect outputs 


P p 
E =X O(-o,u,) (= X O(—u,r,) ] (8.5) 


p=1 p=] 


where O(x) is the step function. The generalization function corresponding to 
the energy (8.5) is the probability that sgn(w) is different from sgn(v), which is 
evaluated by integrating P(u, v) over the region of uv < 0: 


E(R)= [ow dv P(u, v)O(—uv) = = cos! R. (8.6) 
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Fic. 8.2. Coupling vectors of the teacher and student, and the input vectors to 
give incorrect. student output (shaded). 


The generalization error €g of a simple perceptron is the average of the gener- 
alization function over the distributions of teacher and student structures. The 
generalization function #(R) depends on the teacher and student structures only 
through the overlap R. Since the overlap R is a self-averaging quantity for a large 
system, the generalization function is effectively equal to the generalization error 
in a simple perceptron. The overlap R will be calculated later as a function of the 
number of examples divided by the system size a = p/N. Thus the generalization 
error is represented as a function of a through the generalization function: 


Egla) = E(R(a)). (8.7) 


The generalization error as a function of the number of examples €,(a) is called 
the learning curve. 

It is possible to understand (8.6) intuitively as follows. The teacher gives 
output 1 for input vectors lying above the plane perpendicular to the coupling 
vector B, Sg : }), B£; = 0 (ie. the output is 1 for 7, Bjé; > 0); the teacher 
output is —1 if the input is below the plane $> j Bi < 0. Analogously, the stu- 
dent determines its output according to the position of the input vector relative 
to the surface Sz perpendicular to the coupling vector J. Hence the probability 
of an incorrect output from the student (generalization error) is the ratio of the 
subspace bounded by Sg and Sy to the whole space (Fig. 8.2). The generaliza- 
tion error is therefore equal to 20/27, where @ is the angle between B and J. In 
terms of the inner product R between B and J, 6/7 is expressed as (8.6). 


8.2 Batch learning 

Statistical-mechanical formulation is a very powerful tool to analyse batch learn- 
ing. A complementary approach, PAC (probably almost correct) learning, gives 
very general bounds on learning curves (Abu-Mostafa 1989; Hertz et al. 1991) 
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while statistical mechanics makes it possible to calculate the learning curve ex- 
plicitly for a specified learning algorithm. 


8.2.1 Bayesian formulation 


It is instructive to formulate the theory learning as a problem of statistical in- 
ference using the Bayes formula because a statistical-mechanical point of view is 
then naturally introduced (Opper and Haussler 1991; Opper and Kinzel 1995). 
We would like to infer the correct couplings of the teacher B, given p exam- 
ples of teacher output o, (= sgn(v,)) for input €, (u = 1,...,p). The result of 
inference is the student couplings J, which can be derived using the posterior 
P(J\o1,...,@p). For this purpose, we introduce a prior P(J), and the conditional 
probability to produce the outputs 01,..., Cp will be written as P(o1,...,op|J). 
Then the Bayes formula gives 


PUI G15 32409) = z 


(8.8) 
where Z is the normalization. Although it is possible to proceed without ex- 
plicitly specifying the form of the conditional probability P(o1,...,@p|J), it is 


instructive to illustrate the idea for the case of output noise defined as 


_ f sgn(v) prob. (1+e7%)71 EA 
ne l —sgn(v) prob. (1+ ef). on) 


The zero-temperature limit (8 = 1/T — oo) has no noise in the output (o = 
sgn(v)) whereas the high-temperature extreme (8 — 0) yields perfectly random 
output. The conditional probability of output op, given the teacher connections 
B for a fixed single input €,,, is thus 


Olov O(-o,,v exp {—B0(—a,,v 
P(o,,|B) Z ( L u) + ( i u) n p{ 5 = u)} (8.10) 


The full conditional probability for p independent examples is therefore 


p p 
P(o, ..-,0p|B) = | | P(ou|B) = (1 + e72)? exp {4 S = O(-onr) 


pl u=l 
(8.11) 
The explicit expression of the posterior is, from (8.8), 
exp {-6 = @(—o,.%4,) } P(J) 
PU Ci ji Op) = 7 ; (8.12) 
which has been derived for a fixed set of inputs €,,...,€,. Note that v, (= 


B-€,,) in the argument of the step function © in (8.11) has been replaced by 
Up (= J-€,,) in (8.12) because in the latter we consider P(o1,...,0p|J) instead 
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of P(a1,...,@p|B) in the former. We shall assume a uniform prior on the sphere, 
P(J) x 6(J?—N) and P(B) x 6(B?—N) as mentioned in §8.1.2. The expression 
(8.12) has the form of a Boltzmann factor with the energy 


p 
peso] 
This is the same energy as we introduced before, in (8.5). For large @ the system 


favours students with smaller training errors. Equation (8.12) motivates us to 
apply statistical mechanics to the theory of learning. 

A few comments are in order. First, input noise, in contrast with output 
noise, stands for random deviations of € to the student from those to the teacher 
as well as deviations of the teacher couplings from the true value. Input noise 
causes various non-trivial complications, which we do not discuss here (Györgyi 
and Tishby 1990). Second, training error is defined as the average of the training 
energy per example Æ/p. The average is taken first over the posterior (8.12) 
and then over the configuration of input €. Training error measures the average 
error for examples already given to the student whereas generalization error is 
the expected error for a new example. We focus our attention on generalization 
error. 


8.2.2 Learning algorithms 


It is instructive to list here several popular learning algorithms and their prop- 
erties. The Bayesian algorithm (Opper and Haussler 1991) to predict the output 
g for a new input € compares the probability of ø = 1 with that of o = —1 and 
chooses the larger one, in analogy with (5.15) for error-correcting codes, 


o=sen(Vt—V7), (8.14) 


where V* is the probability (or the volume of the relevant phase space) of the 
output 1 or —1, respectively, 


VE = J dJ O(+J - E)P(J|o1,..., op). (8.15) 


This is the best possible (Bayes-optimal) strategy to minimize the generalization 
error, analogously to the case of error-correcting codes. In fact the generalization 
error in the limit of large p and N with a = p/N fixed has been evaluated for 
the Bayesian algorithm as (Opper and Haussler 1991) 


0.44 
Eg & — 


(8.16) 
a 

for sufficiently large a, which is the smallest among all learning algorithms (some 
of which will be explained below). A drawback is that a single student is unable 
to follow the Bayesian algorithm because one has to explore the whole coupling 
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space to evaluate V*, which is impossible for a single student with a single set of 
couplings J. Several methods to circumvent this difficulty have been proposed, 
one of which is to form a layered network with a number of simple perceptrons 
in the intermediate layer. All of these perceptrons have couplings generated by 
the posterior (8.12) and receive a common input. The number of perceptrons in 
the layer with o = 1 is proportional to V* and that with o = —1 to V~. The 
final output is decided by the majority rule of the outputs of these perceptrons 
(committee machine). 

An alternative learning strategy is the Gibbs algorithm in which one chooses 
a single coupling vector J following the posterior (8.12). This gives a typical 
realization of J and can be implemented by a single student perceptron. The 
performance is slightly worse than the Bayes result. The asymptotic form of the 
generalization error is, for T — 0, 


0.625 


(8.17) 


in the regime a = p/N — oo (Györgyi and Tishby 1990). We shall present the 
derivation of this result (8.17) later since the calculations are similar to those in 
the previous chapter and include typical techniques used in many other cases. 
The noise-free (T = 0) Gibbs learning is sometimes called the minimum error 
algorithm. 

In the posterior (8.12) we may consider more general forms of energy than the 
simple error-counting function (8.13). Let us assume that the energy is additive 
in the number of examples and write 


p= 3 V (opty). (8.18) 


f=] 


The Hebb algorithm has V(x) = ~x; the minimum of this energy with respect to 
J has J x)? 0&,,, reminiscent of the Hebb rule (7.4) for associative memory, 


as can be verified by minimization of E under the constraint J? = N using a 
Lagrange multiplier. This Hebb algorithm is simple but not necessarily efficient 
in the sense that the generalization error falls rather slowly (Vallet 1989): 


- py 040 


The parameter z = oJ -€ = cu may be interpreted as the level of confidence 
of the student about its output: positive z means a correct answer, and as x 
increases, small fluctuations in J or € become less likely to lead to a wrong 
answer. In the space of couplings, the subspace with x > 0 for all examples is 
called the version space. It seems reasonable to choose a coupling vector J that. 
gives the largest possible x in the version space. More precisely, we first choose 
those J that have the stability parameter x larger than a given threshold «s. 


(8.19) 
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Then we increase the value of « until such J cease to exist. The vector J chosen 
at this border of existence (vanishing version space) is the one with the largest 
stability. This method is called the maximum stability algorithm. The maximum 
stability algorithm can be formulated in terms of the energy 


: © TLK 
VQ) = T ip: (8.20) 


One increases «x until the volume of subspace under the energy (8.20) vanishes in 
the J-space, The asymptotic form of the generalization error for the maximum 
stability algorithm has been evaluated as (Opper et al. 1990) 

0.50 


Ea ——, (8.21) 


which lies between the Bayes result (8.19) and the minimum error (Gibbs) algo- 
rithm (8.17). The inverse-a law of the generalization error has also been derived 
from arguments based on mathematical statistics (Haussler et al. 1991). 


8.2.3 High-temperature and annealed approximations 
The problem simplifies significantly in the high-temperature limit. and yet non- 
trivial behaviour results. In particular, we show that the generalization error can 
be calculated relatively easily using the example of the Gibbs algorithm (Seung 
et al. 1992). 

The partition function in the high-temperature limit (8 — 0) is 


Ve J dJ P(J) (: 830-04) +0(8?) = / dR Z(R) + O(/32) (8.22) 


according to (8.12). The partition function with R specified is rewritten as 


Z(R) = Vr ( i puine, vas / dJ P(J). (8.23) 
R R 


The integral with suffix R runs over the subspace with a fixed value of the overlap 
R. The integral over R in (8.22) is dominated by the value of R that maximizes 
Z(R) or minimizes the corresponding free energy. The configurational average of 
the free energy for fixed R is 


BF(R) = —[log Z(R)] = ~ log Vr + BpE(R), (8.24) 


where the relation (8.6) has been used. The volume Vz is evaluated using the 
angle 0 between B and J as 


Ve x sin? 0 = (1 — R?)N -2/2 (8.25) 


The reason is that the student couplings J with a fixed angle # to B lie on a 
hypersphere of radius N sin and dimensionality N — 2 in the coupling space. 
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Suppose that Fig. 8.2 is for N = 3; that is, J and B both lie on a sphere of radius 
N (= 3). Then the J with a fixed angle @ to B all lie on a circle (dimensionality 
N — 2 = 1) with radius N sin @ on the surface of the sphere. 

We have to minimize the free energy F(R) in the large-N limit. From the 
explicit expression of the free energy per degree of freedom 


BIR = = 008? R 5 log(1 — R?), (8.26) 
í 
where & = Ga, we find 
R a 
= —, 8.27 
IE ee 
The generalization error €g = E(R) is thus 
1 1 a 1 
Eg = ~ COS w=, 8.28 


where the last expression is valid in the large-G limit. The combination à = Ga 
is the natural parameter in the high-temperature limit (small 3) because one 
should present more and more examples (large a) to overwhelm the effects of 
noise. The result (8.28) shows that the inverse-a@ law holds for the learning curve 
in the high-temperature limit. 

Annealed approximation is another useful technique to simplify calculations 
(Seung et al. 1992). One performs the configurational average of Z, instead of 
log Z, in the annealed approximation. This approximation can be implemented 
relatively easily in many problems for which the full quenched analysis is difficult. 
One should, however, appreciate that this is an uncontrolled approximation and 
the result is sometimes qualitatively unreliable. 


8.2.4 Gibbs algorithm 


We now give an example of derivation of the learning curve for the case of the 
noise-free Gibbs algorithm (minimum error algorithm) in the limit of large N 
and p with the ratio a = p/N fixed (Györgyi and Tishby 1990). The techniques 
used here are analogous to those in the previous chapter and are useful in many 
other cases of batch learning. 

According to (8.6), the generalization error €,(a) is determined by the relation 
between R and a. It is useful for this purpose to calculate the volume of the space 
of student couplings J under the posterior (8.12): 


= 3 | dJ ô($ J? — Nye“ FF (8.29) 
j 


with E = $7, O(—o,u,,). In the noise-free Gibbs algorithm the student chooses 
J with a uniform probability in the version space satisfying E = 0. Correspond- 
ingly, the limit  — oo will be taken afterwards. 
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We investigate the typical macroscopic behaviour of the system by taking the 
configurational average over the input vectors, which are regarded as quenched 
random variables. It is particularly instructive to derive the average of log V, 
corresponding to the entropy (logarithm of the volume of the relevant space). 
We use the replica method and express the average of V” in terms of the order 
parameter gag. The normalization Z is omitted since it does not play a positive 
role. The following equation results if we note that © is either 0 or 1: 


l= SIar f I] es f Tas 
* oO i 


(a8) OJ 
[sO ley -N [ [60> Bye -NR TT 6095 727? — Naas) 
a j a j (a8) 7 
[[{e? +(1—e*)O(utv,)}| - (8.30) 
OX, pb 


8.2.5 Replica calculations 


Evaluation of (8.30) proceeds by separating it into two parts: the first half inde- 
pendent of the input (the part including the delta functions), and the second half 
including the effects of input signals (in the brackets [---]). The former is denoted 
by IÑ, and we write the three delta functions in terms of Fourier transforms to 
obtain, under the RS ansatz, 


IN = f Jap exp ILEX (ISP +F Ý JPJ? + GS. IPB; 
~j a,j 


œj (aß) j 


— N (re + mn Dyp + nkG) \ , (8.31) 
where the integrals over the parameters R, q, E, F, and G have been omitted in 
anticipation of the use of the method of steepest descent. 

It is possible to carry out the above integral for each 7 independently. The 
difference between this and (7.83) is only in G, and the basic idea of calculation is 
the same. One diagonalizes the quadratic form and performs Gaussian integration 
independently for each eigenmode.’ It is to be noted that, for a given j, the term 
involving G has the form GB; >>, J} (uniform sum over a) and correspondingly 
the uniform mode in the diagonalized form (with the eigenvalue E + (n ~1)F/2) 
has a linear term. In other words, for a given j, the contribution of the uniform 
mode u (= >), Jj) to the quadratic form is 


z a a ‘ 
i (z Te ; P) u? +iGBju. (8.32) 


18Qne may instead decompose the term involving the double sum (a) to a linear form by 
Gaussian integration. 
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By integrating over u and summing the result over j, we obtain ~iG?N/[4{E + 
(n — 1)F'/2}] in the exponent. Thus in the limit n — 0 


gi(E,F,G)= — log I = an (e-5) 
F ee qk | ine 
-Ipar ~ ip—ap P iGR +h. ee 


The extremization condition of gi (E, F, G) with respect to E, F, and G yields 


i F+iG? qs. R 
2E — F = —, ~ = —, iG = ——. 34 
l-g -2E+F 1-q iG 1—q (8.34) 


We can now eliminate E, F, and G from (8.33) to obtain 


R? 


>: (8.35) 


gı = 5 log(1 —q)+ aT 


Next, the second half ZA of (8.30) is decomposed into a product over p, and 
its single factor is 


(7 = |20(w) []{e-F + Ae *)O(u%)} | . (8.36) 


Here, since the contribution from the region u > 0,v > 0 is the same as that 
from u < 0,v < 0, we have written only the former and have multiplied the 
whole expression by 2. It is convenient to note here that u and v are Gaussian 
variables with the following correlations: 


[ueu®] = (1—9)ba,e+9, [vu] =R, [v7] =1. (8.37) 


These quantities can be expressed in terms of n + 2 uncorrelated Gaussian vari- 
ables t and z“(a = 0,...,n), both of which have vanishing mean and variance 


unity: 
R? of 
a +h w = 4/1 —q2% + /qt ( (a =1,. n) (8.38) 


Equation (8.36) is rewritten using these variables as 


; 2 : 
(e =2 | de [2° 1 Í 7 | Ay 
va 


een E) 


a= | 
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=9 foe| Dz° 
` J —Rt/ y/q- R? 
. | [Daes + (1—e*)O(/1—qz+ vani} . (8.39) 


We collect the two factors (8.35) and (8.39) and take the limit n — 0 to find 


f = lim — logi V" = 2a J Dt / Dz° log fe- 
n—=0 nN | —Rt/ /q—-R2 
ae) [~ lgl =a) + aE ) 
+(1—e~ / De } + 5 log 1 — q) + =. (8.40 
—Va/(1-a) t 2 2(1 — q) 


The extremization condition of (8.40) determines R and q. 


8.2.6 Generalization error at T =0 

It is possible to carry out the calculations for finite temperatures (Györgyi and 
Tishby 1990). However, formulae simplify significantly for the noise-free case of 
T = 0, which also serves as a prototype of the statistical-mechanical evaluation of 
the learning curve. We therefore continue the argument by restricting ourselves 
to T = 0. Let us then take the limit 8 — oo in (8.40) and change the variables 


as 
_ R2 a R2 R 
pee See t, v= a L TE (8.41) 
v4 q q va 
to obtain 


ox (os) © 1 j= R? 
= 2a | bo f Du log | Dz + ~ log(1 — q) +4 m- (8.42 


vq — R? u — Rv i 
; 8.43) 
yiq ( 
The extremization condition of (8.42) with respect to R and q leads to the 
following equations: 


2 o0 eO u sw? /2 — 2 2 a R2 
a? | Du | Du -a = can et (8.44) 
T Jo —OO i Dz 1 — 2q 1— q ' 


| i J a pu Cee? 1a + 2? 
QAJ — Dv Du T = ; p (8.45) 
VE Df De Sep: aA 


The solution of these equations satisfies the relation q = R. Indeed, by setting 
q = R and rewriting the above two equations, we obtain the same formula 


a f” e742"/2 q 
— Da -z = ; (8.46 
T D Sas Dz „i-q ) 


We have changed the variables u = x + y/q/(1 —q)y,v = y in deriving this 
equation. The relation q = R implies that the student and teacher are in a 


w = 
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certain sense equivalent because of the definitions gag = N~! >> IOS is and R = 
wom | fa 
N~* 37 By J}. 
The behaviour of the learning curve in the limit of a large number of examples 
(a — 00) can be derived by setting g = 1 —e (Jej < 1) since J is expected to be 
close to B in this limit. Then (8.46) reduces to 


2 oo sortie 
r e77 
cea?’ J a d Di 


Substitution of this into (8.6) gives the asymptotic form of the learning curve as 


v2 0.625 


as already mentioned in (8.17). This formula shows that the generalization error 
of a simple perceptron decreases in inverse proportion to the number of examples 
in batch learning. It is also known that the present RS solution is stable (Györgyi 
and Tishby 1990). 


8.2.7 Noise and unlearnable rules 


We have discussed the case where the teacher and student are both simple per- 
ceptrons with continuous weights and their binary outputs coincide if the student 
coupling agrees with the teacher coupling. There are many other possible scenar- 
ios of learning. For example, the output of the teacher may follow a stochastic 
process (output noise) as mentioned before, or the input signal may include noise 
(input noise). In both of these cases, the student cannot reproduce the teacher 
output perfectly even when the couplings agree. In the simple case we have been 
treating, output noise deteriorates the generalization error just by a factor mildly 
dependent on the noise strength (temperature) (Opper and Haussler 1991; Op- 
per and Kinzel 1995). Input noise causes more complications like RSB (Györgyi 
and Tishby 1990). More complex structures of the unit processing elements than 
a simple perceptron lead to such effects even for output noise (Watkin and Rau 
1993). 

Another important possibility we have not discussed so far is that the struc- 
ture of the student is different from that of the teacher, so that the student cannot 
reproduce the teacher output in principle (unlearnable or unrealizable rules). If 
the teacher is composed of layers of perceptrons for instance, the teacher may be 
able to perform linearly non-separable rules. A simple perceptron as the student 
cannot follow such teacher outputs faithfully. Or, if the threshold 6 appearing in 
the output sgn(u — @) is different between the teacher and student, the student 
cannot follow the teacher output even if J coincides with B. A more naive in- 
stance is the weight mismatch where the components of the student vector J may 
take only discrete values whereas the teacher vector B is continuous. ‘These and 
more examples of unlearnable rules have been investigated extensively (Seung et 
al. 1992; Watkin and Rau 1992, 1993; Domany et al. 1995; Wong et al. 1997). 
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A general observation is that RSB and spin glass states emerge in certain re- 
gions of the phase diagram in these unlearnable cases. Non-monotonic behaviour 
of the generalization error as a function of temperature is also an interesting 
phenomenon in some of these instances. 


8.3 On-line learning 

The next topic is on-line learning in which the student changes its couplings im- 
mediately after the teacher output is given for an input. In practical applications, 
on-line learning is often more important than batch learning because the former 
requires less memory and time to implement and, furthermore, adjustment to 
changing environments is easier. Similar to the batch case, we mainly investigate 
the learning curve for simple perceptrons under various learning algorithms. The 
formula for the generalization error (8.6) remains valid for on-line learning as 
well, 


8.3.1 Learning algorithms 

Suppose that. both student and teacher are simple perceptrons, and the student 
couplings change according to the perceptron learning algorithm (7.72) as soon 
as an input is given. Expressing the student coupling vector after m steps of 
learning by J, we can write the on-line perceptron learning algorithm as 


J+ = J™ + O(—sgn(u)sgn(v)) sen(v)a = J” + O(—uv) sgn(v)a, (8.49) 


where u = /NJ™ -@/|J™| and v = VNB . æ/|B| are input signals to the 
student and teacher, respectively, and a is the normalized input vector, |a| = 1. 
The o” in (7.72) corresponds to sgn(v) and ng to æ. We normalize x to unity 
since, in contrast to §7.6.2, the dynamics of learning is investigated here in the 
limit of large N and we should take care that various quantities do not diverge. 
The components of the input vector æ are assumed to have no correlations with 
each other, and independent æ is drawn at each learning step; in other words, 
infinitely many independent examples are available. This assumption may not 
be practically acceptable, especially when the number of examples is limited. 
The effects of removing this condition have been discussed as the problem of the 
restricted training set (Sollich and Barber 1997; Barber and Sollich 1998; Heskes 
and Wiegerinck 1998; Coolen and Saad 2000). 
Other typical learning algorithms include the Hebb algorithm 


JOT! = J” + sgn(v)æ (8.50) 
and the Adatron algorithm 
JOT! = J” —uO(—uwv)e. (8.51) 


The Hebb algorithm changes the student couplings by sgn(v)æ irrespective of 
the student output (correct or not) so that the inner product J -æ tends to yield 
the correct output sgn(v) for the next input. The Adatron algorithm is a little 
similar to the perceptron algorithm (8.49), the difference being that the amount 
of correction for incorrect output is proportional to u. 
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8.3.2 Dynamics of learning 


To develop a general argument, we write the learning algorithm in the form 
JT) = J” + f(sgn(v), ua. (8.52) 


The choice of arguments of the function f, sgn(v) and u, shows that the student 
knows only the teacher output sgn(v), not the input signal v to the teacher, the 
latter involving the information on the teacher couplings. 

Some algorithms may be interpreted as the dynamics along the gradient of 
an energy (or cost function) with respect to J. For example, the Hebb algorithm 
(8.50) can be expressed as 


OV (oz: Tm) 
oJ” 


JOH = J” — nm (8.53) 
with o = sgn(v), 1m = 1, and V (y) = —y. If we identify V with the cost function 
analogously to batch learning (§8.2.1), (8.53) represents a process to reduce V 
by gradient descent. with constant learning rate mm = 1.70 Such a viewpoint will 
be useful later in 88.3.5 where adaptive change of îm is discussed. We use the 
expression (8.52) here, which does not assume the existence of energy, to develop 
a general argument for the simple perceptron. 

The learning dynamics (8.52) determines each component J; of the coupling 
vector J precisely. However, we are mainly interested in the macroscopic proper- 
ties of the system in the limit N > 1, such as the coupling length /” = |J™|/VN 
and the overlap R” = (J™- B)/|J™||B| to represent the teacher-student. prox- 
imity. We shall assume that |B| is normalized to VN. Our goal in the present 
section is to derive the equations to describe the time development of these 
macroscopic quantities R and l from the microscopic relation (8.52). 

To this end, we first square both sides of (8.52) and write um = (J™ - a) /I™ 
to obtain 


Ni = Cg oe [fum] a ei, (8.54) 


where fu and f? have been replaced with their averages by the distribution 
function (8.3) of u and v assuming the self-averaging property. By writing |*!— 
™ = di and 1/N = dt, we can reduce (8.54) to a differential equation 


di [£] 


Here ¢ is the number of examples in units of N and can be regarded as the time 
of learning. 

20 The error back propagation algorithm for multilayer networks, used quite frequently in 
practical applications, is formulated as a gradient descent process with the training error as 
the energy or cost function. Note that this algorithm is essentially of batch type, not on-line 
(Hertz et al. 1991). A more sophisticated approach employs natural gradient in information 
geometry (Amari 1997). 
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The equation for R is derived by taking the inner product of both sides of 
(8.52) and B and using the relations B- J” = NI®R™ and v = B- æ. The 
result is 

dt l 22 
The learning curve eg = E(R(t)) is obtained by solving (8.55) and (8.56) and 
substituting the solution R(t) into (8.6). 


8.3.3 Generalization errors for specific algorithms 
We next present explicit solutions for various learning algorithms. The first one 
is the perceptron algorithm 


f = O(—uv)sgn(v). (8.57) 


The averages in the equations of learning (8.55) and (8.56) reduce to the following 
expressions after the integrals are evaluated: 


[fu] = —[fe] = fa dv P(u, v)O(—uv)usgn(v) = r (8.58) 
[f°] = | du dv P(u, v)O(—uv) = E(R) = ~ cos”! R. (8.59) 


Insertion of these equations into (8.55) and (8.56) will give the time dependence 
of the macroscopic parameters R(t) and I(t), from which we obtain the learning 
curve €,(t). 

It is interesting to check the explicit asymptotic form of the learning curve 
as | grows and R approaches unity. Setting R = 1 — e and l = 1/6 with «,d <1, 
we have from (8.55) and (8.56) 


2 

R E cð, (8.60) 

dt 2r Vin dt 2r r 

where use has been made of (8.58) and (8.59). The solution is 
2/3 

/ 1 —2/3 e 2/7 ~1/3 
s] T be ee gs 8.61 
(sa) (3V2)1/8 ene 


The final asymptotic form of the generalization error is therefore derived as 
(Barkai et al. 1995) 


mud v2 
n(3 v2) 1/3 
Comparison of this result with the learning curve of batch learning (8.48) for the 


Gibbs algorithm, €g x a~', reveals that the latter converges much more rapidly 
to the ideal state €g = 0 if the numbers of examples in both cases are assumed to 


€g = E(R) t7 1/3 = 0.28 7t, (8.62) 
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be equal (a = t). It should, however, be noted that the cost (time and memory) 
to learn the given number of examples is much larger in the batch case than in 
on-line learning. 

A similar analysis applies to the Hebb algorithm 


f(sgn(v), u) = sgn(v). (8.63) 
The averages in (8.55) and (8.56) are then 


2R a 3.2 ; 
[fu] ai Jar’ [f ] s 1, [fv] i T (8.64) 


The asymptotic solution of the dynamical equations is derived by setting R = 
1 — e and l = 1/8 with e, <1: 


dô E T Gd & 4 | | 
~ = qf mÂ, =— e ô. 6 
dt m dt 2 P (9.69) 
The solution is 
T z r1 
aL =|=, 8.66 
at 21 as 
and the corresponding generalization error is (Biehl et al. 1995) 
l -1/2 -1/2 ; 
Eg © eet 1? = 0.40077”, 8.67 
& VIr ( ) 


It is seen that the learning curve due to the Hebb algorithm, eg œ pole ape 


proaches zero faster than in the case of the perceptron algorithm, €g œ% i>. Tt 
is also noticed that the learning curve (8.67) shows a relaxation of comparable 
speed to the corresponding batch result (8.19) if we identify ¢ with a. 

The Adatron algorithm of on-line learning has 


f(sgn(v), u) = —uO(—uv), (8.68) 


and integrals in the dynamical equations (8.55) and (8.56) are 


Pe sis Ru 1 — R)3/? 7 
[ful = -v3 f Du u*Erfe (m) , [fv] = Cn 5) Ripa). (8.69) 


With (8.68), [f?] is equal to —[fu]. The asymptotic form for e = 1— R < 1 is, 
writing c = 8/(3/2r), 


T 


3/2 fp% na WOA Nak 
[ful = A | y*dy Erfe(y) = —ce®/?, [fu] = (- + a2) e/_ (8.70) 
0 T 


To solve the dynamical equations, we note that (8.55) has the explicit form 
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dl 1 agi ; 
= (-1 + z) ce?l?, (8.71) 


which implies that | does not change if | = 1/2. It is thus convenient to restrict 
ourselves to | = 1/2, and then we find that (8.56) is 


de oz 2 (: = z2) e3/2, (8.72) 


This equation is solved as € = 4/(kt)?, where k = 4\/2/m — 2c. The generalization 
error is therefore (Biehl and Riegler 1994) 
o n22 13 
em ak t Oe 


(8.73) 


It is remarkable that on-line Adatron learning leads to a very fast convergence 
comparable to the batch case discussed in §8.2.2 after identification of œa and t. 


8.3.4 Optimization of learning rate 


Let us go back to the dynamical equation of learning (8.52) with the learning 
rate 7 written explicitly: 


JT! = J” 40, f(sgn(v), uje. (8.74) 


The constant learning rate nm = 1 as in (8.52) keeps each component of J 
fluctuating even after sufficient convergence to the desired result. Apparently, it 
seems desirable to change the coupling vector J rapidly (large 7) at the initial 
state of learning and, in the later stage, adjust J carefully with small 7. Such 
an adjustment of learning rate would lead to an acceleration of convergence. 
We discuss this topic in the present and next subsections. We first formulate the 
problem as an optimization of the learning rate, taking the case of the perceptron 
algorithm as an example (Inoue et al. 1997). Then a more general framework is 
developed in the next subsection without explicitly specifying the algorithm. 

The discrete learning dynamics (8.74) is rewritten in terms of differential 
equations in the limit of large N as in (8.55) and (8.56) with f multiplied by n. 
For the perceptron algorithm with simple perceptrons as the teacher and student, 
the resulting equations are 


di n(R—-1) . n*cos 1 R 


di V2 2rl (Sp 
ap n2 P cos”! 
dR _ n(R- -1) n° Roos os (8.76) 


dt VJ 27 Irl? 


where we have used (8.58) and (8.59). 
Since €g = E(R) = 0 for R = 1, the best strategy to adjust 7 is to accelerate 
the increase of R towards R = 1 by maximizing the right hand side of (8.76) 


176 LEARNING IN PERCEPTRON 


with respect to 7 at each value of R. We thus differentiate the right hand side of 
(8.76) by 7 and set the result to zero to obtain 


oir UR? —1) | 
"= -Res Rh ` 7) 


Then the dynamical equations (8.75) and (8.76) simplify to 


di. h= 1) (Rp) 7 
oman aa om (8.78) 
dR (R? yt 

dt  4Rcos-1 R 


Taking the ratio of both sides of these equations, we can derive the exact relation 
between R and | as 
cR 


=. 8.80 
(R+1)? (2:80) 
The constant c is determined by the initial condition. This solution indicates 
that l approaches a constant as R — 1. 
The asymptotic solution of R(t) for R = 1 —e (\e| < 1) can easily be derived 
from (8.79) as e + 8/t?. Therefore the generalization error is asymptotically 


(8.79) 


cos~! R _4 


Ee ae 
r rt 


Eg = (8.81) 
which is much faster than the solution with constant learning rate (8.62). The 
learning rate decreases asymptotically as 7 x 1/t as we expected before. 

A weakness of this method is that the learning rate depends explicitly on 
R, as seen in (8.77), which is unavailable to the student. This difficulty can be 
avoided by using the asymptotic form 7 œ 1/t for the whole period of learning al- 
though the optimization at intermediate values of t may not be achieved. Another 
problem is that the simple optimization of the learning rate does not necessarily 
lead to an improved convergence property in other learning algorithms such as 
the Hebb algorithm. Optimization of the learning algorithm itself (to change the 
functional form of f adaptively) is a powerful method to overcome this point 
(Kinouchi and Caticha 1992). 


8.3.5 Adaptive learning rate for smooth cost function 


The idea of adaptive learning rate can be developed for a very general class of 
learning algorithms including both learnable and unlearnable rules (Muller et 
al. 1998). Let us assume that the cost function for a single input V(a,o; J) 
is differentiable by J.24 The output may be vector-valued ø. The goal is to 
minimize the total energy 


E(J) =[V(x,0;J)], (8.82) 


21 Note that this condition excludes the perceptron and Adatron algorithms. 
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which is identified with the generalization error €g. The energy is assumed to be 
differentiable around the optimal state J = B: 

1, 
2 
Here K(B) is the value of the second-derivative matrix (Hessian) of the energy 
E(J) at J = B. The on-line dynamics to be discussed here is specified by 


E(J) = E(B) + =*(J — B)K(B)(J — B). (8.83) 


JOT = J” = ing) (8.84) 
Am+1 = Im + a{b(V (a, o; J) > Et) z Nm}, (8.85) 


where Ey = $, V(®@u,0 p; J)/p is the training error, and a and b are positive 
constants. Equation (8.84) is a gradient descent with the direction modified by 
K~'(J): the eigendirection of K(J) with the smallest eigenvalue has the fastest 
rate of change. According to (8.85), 7 increases if the current error V exceeds 
the cumulative training error Æ. The final term with the negative sign in (8.85) 
has been added to suppress uncontrolled increase of the learning rate. 

The corresponding differential form of the learning dynamics is 


aJ OV 
d SRE D | (8.86) 
Gp = antbllV] - E) - n} 


We now expand V on the right hand sides of these equations around J = B: 


S ~ K(B)(J — B) 


Os i (8.87) 
V] — E œ E(B) — Er + z0 ~ B)K(B)(J — B). 
Then the dynamical equations reduce to 
H = (J — B) 
T b (8.88) 
— = an Ba ~ B)K(B)(J — B) — n} : 
dt 2 
These equations can be rewritten in terms of the energy as 
ED ont e(J) ~ E(B)} : 
a : (8.89) 
an = abn{ E(J) — E(B)} — an’. 
The solution is easily found to be 
1/1 Je 
m f = soem ae 
ég = E( J) = E(B) + 7 G =) ; (8.90) 


178 LEARNING IN PERCEPTRON 


1 


=. (8.91) 


n 
Choosing a > 2, we have thus shown that the learning rate with the 1/t law 
(8.91) leads to rapid convergence (8.90) just as in the previous subsection but 
under a very general condition. It is possible to generalize the present framework 
to the cases without a cost function or without explicit knowledge of the Hessian 
K (Miiller et al. 1998). 


8.3.6 Learning with query 


Inputs satisfying the condition u = 0 lie on the border of the student output 
sgn(u) (decision boundary). The student is not sure what output to produce for 
such inputs. It thus makes sense to teach the student the correct outputs for 
inputs satisfying u = 0 for efficient learning. This idea of restricting examples to 
a subspace is called learning with query (Kinzel and Ruján 1990). One uses the 
distribution function 


5 w2 p2 — 9 Pam 
Poen e (e) (8.92) 


Jin(l— RA a(1 — R3) 


instead of (8.3). This method works only for the Hebb algorithm among the three 
algorithms discussed so far because the perceptron algorithm has f x O(—uv) 
which is indefinite at u = 0, and the f for the Adatron algorithm is proportional 
to u and vanishes under the present condition 6(w). 

The dynamical equations (8.55) and (8.56) have the following forms for the 
Hebb algorithm with query: 


d 1 . 
Go (8.93) 


dR 1 /2—R2) R 
d l zx H2 (8.94) 


The first equation for | immediately gives | = vt, and the second has the asymp- 
totic solution for small e (R = 1 — €) as e œ 7/16t. The generalization error then 
behaves asymptotically as 


1 z 
Eg © I (8.95) 
Comparison with the previous result (8.67) reveals that asking queries reduces 
the prefactor by a half in the generalization error. 
It is possible to improve the performance further by combining query and 
optimization of the learning rate. Straightforward application of the ideas of the 
present section and §8.3.4 leads to the optimized learning rate 


1 /2(1 — R?) 
R r 


n= (8.96) 
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T(V) 


Fic. 8.3. Output Ta (v) of the reversed-wedge-type perceptron 


for the Hebb algorithm with query. The overlap is solved as R = v1 — ce™2a/7 
and the generalization error decays asymptotically as 
bg & fen alm, (8.97) 
T 
a very fast exponential convergence. 

We have so far discussed learning with query based on a heuristic argument 
to restrict the training set to the decision boundary. It is possible to formulate 
this problem in a more systematic way using an appropriate cost function to be 
extremized (such as information gain); one can then construct the best possible 
algorithm to ask queries depending upon various factors. See Sollich (1994) and 
references cited therein for this and related topics. 


8.3.7 On-line learning of unlearnable rule 


It is relatively straightforward, compared to the batch case, to analyse on-line 
learning for unlearnable rules. Among various types of unlearnable rules (some 
of which were mentioned in 88.2.7), we discuss here the case of the reversed- 
wedge-type non-monotonic perceptron as the teacher (Inoue et al. 1997; Inoue 
and Nishimori 1997). The student is the usual simple perceptron. The input 
signal € is shared by the student and teacher. After going through the synaptic 
couplings, the input signal becomes u for the student and v for the teacher as 
defined in (8.1). The student output is sgn(u) and the teacher output is assumed 
to be T,(v) = sgn{v(a — v)(a + v)}, see Fig. 8.3. 

The generalization error is obtained by integration of the distribution function 
of u and v, (8.3), over the region where the student output is different from that 
of the teacher: 

0 
g=E(R)= fa dv P(u, v)O(—Ta(v)sgn(u)) = 2 f DtQ(R,t), (8.98) 


of mm OKO 


where 


Q(R, t) = Ji {9(—-z V1 — R? — Rt — a) + O(z V1 — R? + Rt) 
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Fic. 8.4. Generalization error of the non-monotonic perceptron 


—O(zV1— R? + Rt —a)}. (8.99) 


We have used in the derivation of (8.98) that (1) f duP(u, v)O(—Ta(v)sgn(u)) 
is an even function of u (and thus the integral over u < 0 is sufficient if we 
multiply the result by 2) and that (2) z and t can be written in terms of two 
independent Gaussian variables with vanishing mean and variance unity as u = t 
and v = z/1— R? + Rt. In Fig. 8.4 we have drawn this generalization error as a 
function of R. This non-monotonic perceptron reduces to the simple perceptron 
in the limit a — oo; then E(R) is a monotonically decreasing function of R and 
has the minimum value F(R) = 0 at R = 1. When a = 0, the student output is 
just the opposite of the teacher output for the same input, and we have E(R) = 0 
at R= —1. 

For intermediate values 0 < a < oo, the generalization error does not vanish 
at any R, and the student cannot simulate the teacher irrespective of the learning 
algorithm. As one can see in Fig. 8.4, when 0 < a < aa = /2log?2 = 1.18, 
there is a minimum of E(FR) in the range —1 < R < 0. This minimum is the 
global minimum when 0 < a < ag œ% 0.08. We have shown in Fig. 8.5 the 
minimum value of the generalization error (a) and the R that gives this minimum 
as functions of a (b). In the range a > aez the minimum of the generalization 
error is achieved when the student has the same couplings as the teacher, R = 1, 
but the minimum value does not vanish because the structures are different. In 
the case of 0 < a < aco, the minimum of the generalization error lies in the range 
~L<R<O. 

The generalization error as a function of R, (8.98), does not depend on the 
type of learning. The general forms of the dynamical equations (8.55) and (8.56) 
also remain intact, but the function f is different. For instance, the Hebb al- 
gorithm in the case of a non-monotonic teacher has f = sgn{v(a — v)(a + v)}. 
Then 
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(a) (b) 


Fic. 8.5. Minimum value of the generalization error (a) and overlap R that 
gives the minimum of the generalization error (b). 


[fu] = \/2Ra- Qe70"/2), [fu] = 2a = 967" 75, [f°] =1. (8.100) 


Accordingly, the dynamical equations read 


dl 1 2 

di at 2ra — 2e7"/2) (8.101) 
dR R 1 /2 a 

a ae 7V =a Be OAV awh), (8.102) 


Solutions of (8.101) and (8.102) show different behaviour according to the value 
of a. To see this, we set R = 0 on the right hand side of (8.102): 


dR 1 /2 a? 
z~“ 2a Jea; (8.103) 


This shows that R increases from zero when a > acı = yvŽlog2 and decreases if 
0 <a < ae. Since (8.102) has a fixed point at R = 1 and | — oo, R approaches 
one as t — œ when a > aei, and the learning curve is determined asymptotically 
from (8.101) and (8.102) using R = 1 — e and l = 1/06. It is not difficult to check 
that e œ% (2k*t)-1,6 = (kt)! holds as ¢,6 <1 with k = v2(1 — Qe-2°/2) | T. 
Substituting this into (8.98), we find the final result 


eg © = ei abil (=) 
1 1 
T Vaa — 2e-@ VA Sebi (5) (8.104) 


The second term on the right hand side is the asymptotic value as t — oo and 
this coincides with the theoretical minimum of generalization error shown in Fig. 
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8.5(a). This achievement of the smallest possible error is a remarkable feature 
of the Hebb algorithm, and is not shared by other on-line learning algorithms 
including the perceptron and Adatron algorithms. 

Next, when 0 < a < aei, we find R — —1 as t — oo. If we set R= —1 +e, l = 
1/6, the asymptotic form is evaluated as 


1 1 ae a 
eg S eee 1 - Erie (=) l 8.105 
én (1 — 2e-27/2) vt Va V2 a 


The asymptotic value after the second term on the right hand side is larger than 
the theoretical minimum of the generalization error in the range 0 < a < dey. 
Thus the Hebb algorithm is not the optimal one for small values of a. As one 
can see in Fig. 8.4, the value R = —1 does not give the minimum of E(R) in 
the case of 0 < a < a,_,. This is the reason why the Hebb algorithm, which gives 
R — —1 as t — oo, does not lead to convergence to the optimal state. 


Bibliographical note 


We have elucidated the basic ideas of learning in very simple cases. There are 
many interesting problems not discussed here, including the effects of noise, 
perceptron with continuous output, multilayer networks, on-line Bayesian learn- 
ing, path-integral formalism, support vector machines, restricted training set, 
information-theoretical approaches, and unsupervised learning. These and other 
topics are discussed in various review articles (Watkin and Rau 1993; Domany 
et al. 1995; Wong et al. 1997; Saad 1998). 


9 
OPTIMIZATION PROBLEMS 


A decision-making problem is often formulated as minimization or maximization 
of a multivariable function, an optimization problem. In the present chapter, after 
a brief introduction, we show that methods of statistical mechanics are useful to 
study some optimization problems. Then we discuss mathematical properties of 
simulated annealing, an approximate numerical method for generic optimization 
problems. In particular we analyse the method of generalized transition prob- 
ability, which is attracting considerable attention recently because of its rapid 
convergence properties. 


9.1 Combinatorial optimization and statistical mechanics 


The goal of an optimization problem is to find the variables to minimize (or 
maximize) a multivariable function. When the variables take only discrete values 
under some combinatorial constraints, the problem is called a combinatorial opti- 
mization problem. The function to be minimized (or maximized) f (£1, £2,..., En) 
is termed the cost function or the objective function. It is sufficient to discuss 
minimization because maximization of f is equivalent to minimization of — f. 

An example of a combinatorial optimization problem familiar to physicists 
is to find the ground state of an Ising model. The variables are the set of spins 
{51,S2,...,5n} and the Hamiltonian is the cost function. The ground state is 
easily determined if all the interactions are ferromagnetic, but this is not the 
case in spin glasses. The possible number of spin configurations is 2%, and we 
would find the correct ground state of a spin glass system if we checked all of 
these states explicitly. However, the number 27 grows quite rapidly with the 
increase of N, and the ground-state search by such a naive method quickly runs 
into the practical difficulty of explodingly large computation time. Researchers 
have been trying to find an algorithm to identify the ground state of a spin glass 
by which one has to check less than an exponential number of states (i.e. power 
of N). These efforts have so far been unsuccessful except for a few special cases. 

The ground-state determination is an example of an NP (non-deterministic 
polynomial) complete problem. Generally, the algorithm to solve an NP complete 
problem can be transformed into another NP complete problem by a polynomial- 
time algorithm; but no polynomial-time algorithm has been found so far for any 
of the NP complete problems. These statements define the class ‘NP complete’. 
It is indeed expected that we will need exponentially large computational efforts 
to solve any NP complete problem by any algorithm. 
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An exponential function of N grows quite rapidly as N increases, and it is 
virtually impossible to solve an NP complete problem for any reasonable value of 
N. There are many examples of NP complete problems including the spin glass 
ground state, travelling salesman, number partitioning, graph partitioning, and 
knapsack problems. We shall discuss the latter three problems in detail later. A 
description of the satisfiability problem will also be given. Here in this section, 
a few words are mentioned on the travelling salesman problem. 

In the travelling salesman problem, one is given N cities and distances be- 
tween all pairs of cities. One is then asked to find the shortest path to return to 
the original city after visiting all the cities. The cost function is the length of the 
path. There are about N! possible paths; starting from a city, one can choose the 
next out of N — 1, and the next out of N — 2, and so on. The precise number of 
paths is (N—1)!/2 because, first, the equivalence of all cities as the starting point 
gives the dividing factor of N, and, second, any path has an equivalent one with 
the reversed direction, accounting for the factor 2 in the denominator. Since the 
factorial increases more rapidly than the exponential function, identification of 
the shortest path is obviously very difficult. The travelling salesman problem is a 
typical NP complete problem and has been studied quite extensively. It also has 
some practical importance such as the efficient routing of merchandise delivery. 
A statistical-mechanical analysis of the travelling salesman problem is found in 
Mézard et al. (1987). 

Before elucidating statistical mechanical approaches to combinatorial opti- 
mization problems, we comment on a difference in viewpoints between statisti- 
cal mechanics and optimization problems. In statistical mechanics, the target of 
primary interest is the behaviour of macroscopic quantities, whereas details of 
microscopic variables in the optimized state play more important roles in opti- 
mization problems. For example, in the travelling salesman problem, the shortest 
path itself is usually much more important than the path length. In the situation 
of spin glasses, this corresponds to clarification of the state of each spin in the 
ground state. Such a point of view is somewhat different from the statistical- 
mechanical standpoint in which the properties of macroscopic order parameters 
are of paramount importance. These distinctions are not always very clear cut, 
however, as is exemplified by the important role of the TAP equation that. is 
designed to determine local magnetizations. 


9.2 Number partitioning problem 
9.2.1 Definition 
In the number partitioning problem, one is given a set of positive numbers A = 


{a1,@2,...,an} and asked to choose a subset B C A such that the partition 
difference 


E= Xai- ` Gi (9.1) 


iEB iE A\B 
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is minimized. The partition difference is the cost function. The problem is uncon- 
strained if one can choose any subset B. In the constrained number partitioning 
problem, the size of the set |B) is fixed. In particular, when |B] is half the total 
size, |B| = N/2 (N even) or |B| = (N +1)/2 (N odd), the problem is called the 
balanced number partitioning problem. 

The number partitioning problem is known to belong to the class NP com- 
plete (Garey and Johnson 1979). Below we develop a detailed analysis of the 
unconstrained number partitioning problem as it is simpler than the constrained 
problem (Sasamoto et al. 2001; Ferreira and Fontanari 1998). We will be in- 
terested mainly in the number of partitions for a given value of the partition 
difference. 


9.2.2 Subset sum 


For simplicity, we assume that N is even and the a; are positive integers, not 
exceeding an integer L, with the gcd unity. The number partitioning problem is 
closely related to the problem of subset sum in which we count the number of 
configurations, C(#), with a given value E of the Hamiltonian 


Here n; is a dynamical variable with the value of 0 or 1. The cost function of 
the number partitioning problem (9.1), to be denoted as H, is related to this 
Hamiltonian (9.2) using the relation S; = 2n; — 1 as 


N N 
H=|5S_ a:S;| = |2H — > ai]. (9.3) 
i=l i=i 
The spin configuration S; = 1 indicates that a; € B and S; = —1 otherwise. From 


(9.3) we have 2E — X; a; = +E, and the number of configurations, C(E), for 
the subset sum (9.2) is translated into that for the number partitioning problem 
(9.3), C(E), by the relation 


(Ftd) ¢ (Atte) wz0) 
C(E) = (9.4) 
o (Ese) i= 0) 


The following analysis will be developed for a given set {a;}; no configurational 
average will be taken. 


9.2.3 Number of configurations for subset sum 


It is straightforward to write the partition function of the subset. sum: 
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N 


Za ye =e) (9.5) 
{ni} 


i= | 


By expanding the right hand side in powers of e7? = w, we can express the 
above equation as 


oa) 
E max 


Z= ` C(E\w®, (9.6) 


E=0 


where Emax = @1 +-+ + ayn. Since E is a positive integer, the coefficient C(E) 
of the polynomial (9.6) is written in terms of a contour integral: 


. 1 dw 1 f eleeZ 


where the integral is over a closed contour around the origin of the complex-w 
plane. Since log Z is proportional to the system size N (> 1), we can evaluate 
(9.7) by steepest descent. 

It is useful to note that the value of the energy Æ has a one-to-one correspon- 
dence with the inverse temperature 8 through the expression of the expectation 
value 


N 
ay li , 
ii 2, 1 + efa: oP) 
tom 


when the fluctuations around this thermal expectation value are negligible, which 
is the case in the large-N limit. Let us therefore write the inverse temperature as 
8o which yields the given value of the energy, Eo. It is then convenient to rewrite 
(9.7) by specifying the integration contour as the circle of radius e~° and using 
the phase variable defined by w = e~%+'?; 


T iy p 
O(Eo) = = fd cl982+A0Bo~iBo0, (9.9) 
L ie: i 


It is easy to verify that the saddle point of the integrand is at 0 = 0.?? We 
therefore expand log Z in powers of 0 (= iĝ — iĝo) to second order and find 


Oo, : 1 T y2 6y2 
C (Eo) = e8 Zla=19 + BoB , F dł exp | — — -z log Z . (9.10) 
SR dee 238 | Jep | 


Since the exponent in the integrand is proportional to N, we may extend the 
integration range to too. The result is 


eGo Eo a 4 e~ foni ) 
2r >), a2 /(1 + e%0%)(1 + eoa) 


here exist other saddle points if the ged of {a;} is not unity. 


C(Eo) = (9.11) 


22 T 
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Fic. 9.1. The number of configurations as a function of the energy for the subset 
sum with N = 20, L = 256, and {a;} = {218, 13, 227, 193, 70, 134, 89, 
198, 205, 147, 227, 190, 64, 168, 4, 209, 27, 239, 192, 131}. The theoretical 
prediction (9.11) is indistinguishable from the numerical results plotted in 
dots. 


This formula gives the number of configurations of the subset sum as a function 
of Eo through 
N 


+ Ai 
A er (9.12) 


i=1 
Figure 9.1 depicts the result of numerical verification of (9.11). 


9.2.4 Number partitioning problem 


The number of partitions for a given value of E of the number partitioning 
problem, C(E), can be derived from (9.11) using (9.4). For example, the optimal 
configuration with Æ = 0 has % = 0 according to the relation 2E — $; a; = +E 
and (9.12). Then, (9.11) with £o = 0 gives the result for the number of optimal 
partitions as 


gN+1 


C(E = 0) = (9.13) 


An important restriction on the applicability of the present argument is that 
L should not exceed 2%’ for the following reason. If we choose {a;} from {1,..., L} 
using a well-behaved distribution function, then X, a? = O(L”), and (9.13) gives 
C(E = 0) = O(2%/L). Since Č is the number of partitions, 2% /Z must not be 
smaller than one, which leads to the condition 2’ > L. This condition may 
correspond to the strong crossover in the behaviour of the probability to find a 
perfect partition (E = 0) found numerically (Gent and Walsh 1996). The other 
case of 2% < L is hard to analyse by the present method although it is quite 
interesting from the viewpoint of information science (Gent and Walsh 1996). It 
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is possible to apply the present technique to the constrained number partitioning 
problem (Sasamoto et al. 2001). 

Mertens (1998) applied statistical mechanics directly to the system with the 
Hamiltonian (9.3) and derived the energy E(T) and entropy S(T) as functions 
of the temperature T. The resulting S as a function of E differs from ours 
under the identification S(E) = log C(E) except at E = 0. In particular, his 
result gives 0S/OE > 0 whereas we have the opposite inequality 0S/0E < 
0 as can be verified from (9.11) and (9.12). Figure 9.1 shows that the latter 
possibility is realized if we note that Æ = 0 corresponds to the peak of the curve 
at E = $; a;/2. It is possible to confirm our result also in the solvable case with 


a, =- = ay = 1: we then have H = | 57, S| for which 


me N ; 
C(E) = ( ey (9.14) 


This is a monotonically decreasing function of E. Equations (9.11) and (9.12) 
with a] =---=ay = 1 reproduce (9.14) for sufficiently large N. 

The system described by the Hamiltonian (9.3) is anomalous because the 
number of configurations (partitions) decreases with increasing energy. We should 
be careful in applying statistical mechanics to such a system; the calculated en- 
tropy may not necessarily give the logarithm of the number of configurations as 
exemplified above. 


9.3 Graph partitioning problem 


The next example of statistical-mechanical analysis of an optimization problem 
is the graph partitioning problem. It will be shown that the graph partitioning 
problem is equivalent to the SK model in a certain limit. 


9.3.1 Definition 

Suppose we are given N nodes V = {v1, v2,...,un} and the set of edges between 
them Æ = {(v;,v;)}. Here N is an even integer. A graph is a set of such nodes 
and edges. In the graph partitioning problem, we should divide the set V into 
two subsets V; and V2 with the same size N/2 and, at the same time, minimize 
the number of edges connecting nodes in V; with those in V2. The cost function is 
the number of edges between V; and V2. Let us consider an example of the graph 
specified by N = 6, E = {(1, 2), (1, 3), (2, 3), (2, 4), (4,5)}. The cost function has 
the value f = 1 for the partitioning Vi = {1,2,3},V2 = {4,5,6} and f = 3 for 
Vi = {1, 2,4}, Vo = {3, 5, 6} (Fig. 9.2). 

It is known that the graph partitioning problem is NP complete. The prob- 
lem has direct instances in real-life applications such as the configuration of 
components on a computer chip to minimize wiring lengths. 

The problem of a random graph, in which each pair of nodes (v;,v;) has 
an edge between them with probability p, is conveniently treated by statistical 
mechanics (Fu and Anderson 1986). We assume in this book that p is of order 


GRAPH PARTITIONING PROBLEM 189 


Fic. 9.2. A graph partitioning with N = 6 


1 and independent of N. Thus the number of edges emanating from each node 
is pN on average, which is a very large number for large N. The methods of 
mean-field theory are effectively applied in such a case. 


9.3.2 Cost function 

We start our argument by expressing the cost function f(p) in terms of the Ising 
spin Hamiltonian. Let us write S; = 1 when the node v; belongs to the set V; and 
Si = —1 otherwise. The existence of an edge between v; and v; is represented by 
the coupling Jij = J, and J;; = 0 otherwise. The Hamiltonian is written as 


H = -X Jig S85 = -5 > = > + >D af >D Jij 


i<j EVL JEVI EV JEV? EVL JEV? iEVa JEVI 


J 2N(N —1)p 
aay > aa D dj =-7' ve + 2f(p)J, (9.15) 
EVL JEV? EV JEVI 


where N(N — 1)p/2 is the total number of edges. The cost function and the 
Hamiltonian are therefore related to each other as 


H 1 
= — + -N(N — l)p. 9.16 
fp) = 55 + GN(N -DP (9.16) 
This equation shows that the cost function is directly related to the ground state 
of the Hamiltonian (9.15) under the condition that the set V is divided into two 
subsets of exactly equal size 


N 
pe) (9.17) 
qo |] 


Equation (9.17) does not change if squared. Expansion of the squared expression 
may be interpreted as the Hamiltonian of a system with uniform antiferromag- 
netic interactions between all pairs. Thus the partitioning problem of a random 
graph has been reduced to the ferromagnetic Ising model with diluted interac- 
tions (i.e. some Jij are vanishing) and with the additional constraint of uniform 
infinite-range antiferromagnetic interactions. 
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As was mentioned above, we apply statistical-mechanical methods that are 
suited to investigate the average (typical) behaviour of a macroscopic system 
of very large size. In the case of the graph partitioning problem, the principal 
objective is then to evaluate the cost function in the limit of large N. The 
Hamiltonian (the cost function) is self-averaging, and hence we calculate its 
average with respect to the distribution of random interactions, which should 
coincide with the typical value (the value realized with probability 1) in the 
thermodynamic limit N — oo. 


9.3.3 Replica expression 
Let us calculate the configurational average by the replica method. The replica 
average of the partition function of the system (9.15) is 


[2"] = (1-p)NA-YP Ty [T fı + po exp (a y ss} (9.18) 
i<j ass | 


where po = p/(1 — p), and Tr denotes the sum over spin variables under the 
condition (9.17). We shall show below that (9.18) can be transformed into the 
expression 


N(N—1 N ; 
Z”) =(1- penn? exp {acu log(1 + po) — z (Jean + Fon?) 


2 
. Trexp ear (rset +O PPE SS]? (019) 


a8 i<j & 


where 


le 
pS, a pl. (9.20) 


To derive (9.19) we expand ie Pa and exponential functions as 
follows: 


X` log{1 + po EON 


pe 
a ae a : 7 So = ae ae -JY S$ s282) yea bebe, 
lex] kı=0 k;=0 i<j & 


It is convenient to rewrite this formula as a power series in ØJ. The constant 
term corresponds to ky =- = kı = 0, for which the sum over | gives {N(N — 
1)/2} log(1 + po). The coefficient of the term linear in GJ is, according to (9.17), 
a constant 


3 a = D B -N}= — Pes (9.21) 


i=] 
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Next, the coefficient of the quadratic term is, after consideration of all possibili- 
ties corresponding to ky +--- + kı = 2 (such as (ki = ko = 1), (ky = 2, ko = 0), 
and so on), 


LADE (G) +a) 


TN P ao? 
DLS DODAN 
ap į 


C2 sa Nn? 
5 S887)? — Se. (9.22) 


lI 


i 


Equation (9.19) results from (9.21) and (9.22). 


9.3.4 Minimum of the cost function 


Recalling (9.16), we write the lowest value of the cost function as 


N? Ls 
fe) = P+ a7 he 
_ p $ 1 - (BS)? a oly 
Bg = Jim fim, (5) [rep ei A LORS 
+0( PPO Ses] p1], (9.23) 
i<j & 


where we have used the fact that the contribution of the term linear in GJ to 
f(p) is Np/4 from cı = po/(1 + po) = p. Equation (9.23) is similar to the SK 
model expression (2.12) with Jo = 0 and h = 0. The additional constraint (9.17) 
is satisfied automatically because there is no spontaneous magnetization in the 
SK model when the centre of distribution Jo and the external field h are both 
vanishing. 

Since J has been introduced as an arbitrarily controllable parameter, we 
may choose J = J/N and reduce (9.23) to the same form as (2.12). Then, if 
N > 1, the term proportional to 6°J* is negligibly smaller than the term of 
O(8? J?), and we can apply the results for the SK model directly. Substituting 
co = p(1 — p)/2, we finally have 


N? 


N? VN : N ES 
f(p) = are ar VONN ain ea 5 UN PUL =p). (9.24) 


Here Up is the ground-state energy per spin of the SK model and is approxi- 
mately Ug = —0.38 according to numerical studies. The first term on the right 
hand side of (9.24) is interpreted as the number of edges between V; and Vo, 
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N?/4, multiplied by the ratio of actually existing edges p. The second term is 
the correction to this leading contribution, which is the non-trivial contribution 
derived from the present argument. 


9.4 Knapsack problem 


The third topic is a maximization problem of a cost function under constraints 
expressed by inequalities. The replica method is again useful to clarify certain 
aspects of this problem (Korutcheva et al. 1994; Fontanari 1995; Inoue 1997). 


9.4.1 Knapsack problem and linear programming 


Suppose that there are N items, each of which has the weight a; and the value 
cj. The knapsack problem is to maximize the total value by choosing appropriate 
items when the total weight is constrained not to exceed b. This may be seen as 
the maximization of the total value of items to be carried in a knapsack within 
a weight limit when one climbs a mountain. 

With the notation S; = 1 when the jth item is chosen to be carried and 
S; = —1 otherwise, the cost function (the value to be maximized) U and the 
constraint are written as follows: 


N N 

aA che aki ; creer! 

U= yoy AH, vaya AH xs, (9.25) 
j=l jo] 


A generalization of (9.25) is to require many (K) constraints: 


1 a 
tis RE (k=1,..., K). (9.26) 


When S; is continuous, the minimization (or maximization) problem of a linear 
cost function under linear constraints is called linear programming. 

In this section we exemplify a statistical-mechanical approach to such a class 
of problems by simplifying the situation so that the c; are constant c and so 
are the by (= b). It is also assumed that az; is a Gaussian random variable with 


mean 4 and variance o°, 


oa 1 Ek 
akj = 3 + kj, P(&kj) = ae exp E . (9.27) 
The constraint (9.26) is then written as 
1 1 N i 
Yp — b = ge ig Pen (9.28) 


The first and second terms in (9.28), where sums over j appear, are of O(N) at 
most, so that (9.28) is satisfied by any S = {S;} if b > N/4. Then one can carry 


KNAPSACK PROBLEM 193 


virtually all items in the knapsack (S; = 1). In the other extreme case N/4 > b, 
one should leave almost all items behind (S; = —1). The system shows the most 
interesting behaviour in the intermediate case b = N/4, to which we restrict 
ourselves. 


9.4.2 Relaxation method 


It is sometimes convenient to relax the condition of discreteness of variables to 
solve a combinatorial optimization problem, the relaxation method, because com- 
putations are sometimes easier and faster with continuous numbers. If discrete 
values are necessary as the final answer, it often suffices to accept the discrete 
values nearest to the continuous solution. 

It is possible to discuss the large-scale knapsack problem, in which the number 
of constraints K is of the same order as N, with S; kept discrete (Korutcheva 
et al. 1994). However, in the present. section we follow the idea of the relaxation 
method and use continuous variables satisfying 2a s? = N because the problem 
is then formulated in a very similar form as that of the perceptron capacity 
discussed in $7.6 (Inoue 1997). Hence the S; are assumed to be continuous real 
numbers satisfying $` F s? = N. We are to maximize the cost function 


N evN 
De 2 


U M (9.29) 


under the constraint 


1 VN ae , 
Ye 5 we +S; +- M <0, M= N 25 (9.30) 


J 


The normalization in the second half of (9.30) implies that 3; S; is of order 
VN when b = N /4, and consequently about half of the items are carried in 
the knapsack. Consistency of this assumption is confirmed if M is found to be 
of order unity after calculations using this assumption. This will be shown to 
be indeed the case. M is the coefficient of the deviation of order VN from the 
average number of items N/2 to be carried in the knapsack. 


9.4.3 Replica calculations 


Let V be the volume of subspace satisfying the condition (9.30) in the space 
of the variables S. The typical behaviour of the system is determined by the 
configurational average of the logarithm of V over the random variables {&,,;}. 
According to the replica method, the configurational average of V” is, similarly 
to (7.77), 


ve] = |v” / [Les2[] 6 | So se - va | 5 | Sse? -N 
at a j 


J 
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1 M 
7 a ee ee, ON se ae 9.3 
Ie |- > tisa ||. (9.31) 


a,k 


where Vo is defined as the quantity with only the part of 5° (sey = N kept in 
the above integrand. 

Equation (9.31) has almost the same form as (7.77), so that we can evaluate 
the former very similarly to §7.6. We therefore write only the result here. Under 
the assumption of replica symmetry, we should extremize the following G: 


[(V"] = exp{nNG(q, E, F, M)} 
G = aGi(q) + G2(E, F, M) — 5qF + iE 


o rdar f” 
Gı(q) = log i J da“ 
JM/2 I] 2T Joo I 


Ce (a4 


(af) 


Go(E, F, 1) = log f If es 


-exp -iMX S* — iF >o S°S° iE yoy 
a (a3) & 


where q is the RS value of gag = NIY, ses? , and a = K/N (not to be 
confused with the replica index a). Extremization by M readily shows that 
M = 0. We also perform the integration and eliminate E and F by extremization 
with respect to these variables, as in $7.6, to find 


i í 


a ~e [ M/2+yoVI Fq 
L(q) = 2V7 Erfe (epee . (9.33) 


As the number of items increases with the ratio a (= K/N} fixed, the system 
reaches a limit beyond which one cannot carry items. We write Mop, for the 
value of M at this limit. To evaluate the limit explicitly we note that there is 
only one way to choose items to carry at the limit, which implies q = 1. We thus 
extremize (9.32) with respect to q and take the limit q — 1 to obtain Mopt as a 
function of a as 


oy vi 
] i S ad Mopt 


J —Mopt /(2V20) 


(9.32) 


Figure 9.3 shows Mop, as a function of a when ø = 1/12. Stability analysis of 
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Fic. 9.3. Mopt as a function of a (full line) and the AT line (dashed) 


the RS solution leads to the AT line: 


: Dee. f “Oo 
af foi- z2-+23)} = |, E Fell i =| . 
' Sr. Dz Ji, J(M/20+tVTF)/VI=4 
(9.35) 


which is shown dashed in Fig. 9.3. The RS solution is stable in the range a < 
0.846 but not beyond. The IRSB solution for Mopt is also drawn in Fig. 9.3, but 
is hard to distinguish from the RS solution at this scale of the figure. We may 
thus expect that further RSB solutions would give qualitatively similar results. 


9.5 Satisfiability problem 


Another interesting instance of the optimization problem is the satisfiability 
problem. One forms a logical expression out of many logical variables in a certain 
special way. The problem is to determine whether or not an assignment of each 
logical variable to ‘true’ or ‘false’ exists so that the whole expression is ‘true’. It 
is possible to analyse some aspects of this typical NP complete problem (Garey 
and Johnson 1979) by statistical-mechanical methods. Since the manipulations 
are rather complicated, we describe only important ideas and some of the steps 
of calculations below. The reader is referred to the original papers for more 
details (Kirkpatrick and Selman 1994; Monasson and Zecchina 1996, 1997, 1998; 
Monasson et al. 1999, 2000). 


9.5.1 Random satisfiability problem 


Let us define a class of satisfiability problems, a random K -satisfiability problem 
(K-SAT). Suppose that there are N logical variables z1,...,£Ẹ, each of which is 
either ‘true’ or ‘false’. Then we choose K (< N) of the variables £i ,..., Zip and 
negate each with probability $: xi, — Ti, (= NOT(aj,)). A clause C; is formed 
by logical OR (V) of these K variables; for example, 


Cy = Wy V Ei V Big V BQ VO V Bae (9.36) 
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This process is repeated M times, and we ask if the logical AND (A) of these M 
clauses 


FHC ACA ACM (9.37) 


gives ‘true’. If an assignment of each x; to ‘true’ or ‘false’ exists so that F is 
‘true’, then this logical expression is satisfiable. Otherwise, it is unsatisfiable. 

For instance, in the case of N = 3,M = 3, and K = 2, we may form the 
following clauses: 


Cy = z1 Vo, Co = £1 V £3, C3 = Fa V 25, (9.38) 


and F = Cy ACs A C3. This F is satisfied by zı = ‘true’, rq = ‘false’, and x3 = 
‘true’. 

It is expected that the problem is difficult (likely to be unsatisfiable) if M 
is large since the number of conditions in (9.37) is large. The other limit of 
large N should be easy (likely to be satisfiable) because the number of possible 
choices of x;,,...,2j, out of 7,,...,a@N is large and hence C; is less likely to 
share the same x; with other clauses. It can indeed be shown that the problem 
undergoes a phase transition between easy and difficult regions as the ratio 
a = M/N crosses a critical value a, in the limit N, M — oo with a and K fixed. 
Evidence for this behaviour comes from numerical experiments (Kirkpatrick and 
Selman 1994; Hogg et al. 1996), rigorous arguments (Goerdt 1996), and replica 
calculations (below): for a < a, (easy region), one finds an exponentially large 
number of solutions to satisfy a K-SAT, a finite entropy. The number of solutions 
vanishes above a, (difficult region) and the best one can do is to minimize the 
number of unsatisfied clauses (MAX K-SAT). The critical values are a, = 1 
(exact) for K = 2, 4.17 for K = 3 (numerical), and 2* log 2 (asymptotic) for 
sufficiently large K. The order parameter changes continuously at a, for K = 2 
but discontinuously for K > 3. The RS solution gives the correct answer for 
a@ <a. and one should consider RSB when a > ag. 

It is also known that the K-SAT is NP complete for K > 3 (Garey and 
Johnson 1979) in the sense that there is no generic polynomial-time algorithm 
to find a solution when we know that a solution exists. By contrast, a linear-time 
algorithm exists to find a solution for K = 1 and 2 (Aspvall et al. 1979). The 
qualitatively different behaviour of K = 2 and K > 3 mentioned above may be 
related to this fact. 


9.5.2 Statistical-mechanical formulation 


Let us follow Monasson and Zecchina (1997, 1998) and Monasson et al. (2000) 
and formulate the random K-SAT. The energy to be minimized in the K-SAT 
is the number of unsatisfied clauses. We introduce an Ising variable S; which 
is 1 if the logical variable x; is ‘true’ and —1 for z; ‘false’. The variable with 
quenched randomness Cu is equal to 1 if the clause C; includes zi, Cu = —1 
if %; is included, and Cu = 0 when x; does not appear in C;. Thus one should 
choose K non-vanishing Cu from {Cy,...,Cjy} and assign +1 randomly for 


SATISFIABILITY PROBLEM 197 


those S ea: components, = C7. = K. Satisfiability is judged by the 
value of ae , CuS;: if it is larger than —K for all /, the problem is satisfiable 
because at least one co the CuS; is 1 (satisfied) in the clause. The energy or the 
Hamiltonian is thus formulated using Kronecker’s delta as 


E(S) = 3 5 3 CuS K j; (9.39) 
i=] dase 
Vanishing energy means that the problem is satisfiable. Macroscopic properties 
of the problem such as the expectation values of the energy and entropy can 
be calculated from this Hamiltonian by statistical-mechanical techniques. One 
can observe in (9.39) that Kronecker’s delta imposes a relation between K spins 
and thus represents an interaction between K spins. The qualitative difference 
between K = 2 and K > 3 mentioned in the previous subsection may be related 
to this fact; in the infinite-range r-spin interacting model discussed in Chapter 
5, the transition is of second order for r = 2 but is of first order for r > 3. 
Averaging over quenched randomness in Cu can be performed by the replica 
method. Since M clauses are independent of each other, we find 


(2m = Tr K(S)™ (9.40) 


K(S) = jexp< — 5 yt (9.41) 


azz] 


where Tr denotes the sum over S. The configurational average [---] is taken over 
the random choice of the C;. We have introduced the temperature T to control 
the average energy. It is useful to note in (9.41) that 


N 
(Yas *-K)= |] 4(s%,-c), (9.42) 


i=1;C1 40 


where the product runs over all the į for which C; is not vanishing. Then ¢x(S) 
of (9.41) is 


n K 
TORD DE >> mo Ss -5 > [[ 888-0). 


Cyst C ig mm tgal igel a=] k=] 
(9.43) 


where we have neglected corrections of O(N~'). This formula for Cx (S) is con- 
veniently rewritten in terms of {c(o)},, the set of the number of sites with a 
specified spin pattern in the replica space ø = {o!,...,0"}: 


N 
N 


Ne(o) => II (S$ (9.44) 


izl a=] 


Equation (9.43) depends on the spin configuration S only through {c(o)}. In- 
deed, if we choose of = —C,S?, Kronecker’s delta in the exponent is d(af, 1). 
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All such terms (there are Nc(—C;o,) of them) in the sum over 7; to ig give the 
same contribution. Hence (9.43) is written as 


CK (S) = Cx ({e}) =- P e a er A Ee OREN) 


exch] Crati oy OK 


f- T SP il slok ah (9.45) 


a=1 k=1 


We may drop the C;-dependence of c(—Cjo;) in (9.45) due to the relation c(a) = 
c(—o). This last equation is equivalent to the assumption that the overlap of odd 
numbers of replica spins vanishes. To see this, we expand the right hand side of 
(9.44) as 


ca) = DHI LEONo = 5 S 
i a ; a 


+ Ow ator: YO QPP... (9.46) 


a<g acp<y 


where 


Qabr = 5 See (9.47) 


The symmetry c(o7) = c(—o) follows from the relation Q* = Q*97 = --- = 0 for 
odd numbers of replica indices, which is natural if there is no symmetry breaking 
of the ferromagnetic type. We assume this to be the case here. 

The partition function (9.41) now has a compact form 


[2"] = / [[ dela) enmt» 
oe {o) “an 2 [I sro" non) (9.48) 


i=] az] 


e= -ave S cloi)...c(ox) 


Olpe OK 


. I] (: + (e7? —1) I] d(af, 0) | (9.49) 


al kes] 


where one should not confuse a = M/N in front of the logarithm with the 
replica index. The trace operation over the spin variables S (after the Tr symbol 
in (9.48)) gives the entropy as 
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Il AGI =e (- N zee ) log ao) (9.50) 


by Stirling’s formula. If we apply the steepest descent method to the integral in 
(9.48), the free energy is given as, in the thermodynamic limit N, M — oo with 
a fixed, 
BEF Ssh s , lore 9.51 
— = -Eo({e}) — > elo) og clo) (9.51) 


or 


with the condition $`, e(o) = 1. 


9.5.3 Replica-symmetric solution and its interpretation 


The free energy (9.51) is to be extremized with respect to c(o). The simple RS 
solution amounts to assuming e(o) which is symmetric under permutation of 
o',...,0”. It is convenient to express the function e(o) in terms of the distribu- 
tion function of local magnetization P(m) as 


ca) = i dm P(m) We i -i du (9.52) 


It is clear that this c(ø) is RS. The extremization condition of the free energy 
(9.51) leads to the following self-consistent equation of P(m) as shown in Ap- 
pendix D: 


1 me u, l+m 
P(m) = y u cos | =| 
(m) ma -5 du cos (5 o8 7 =) 


1 Kel 
-exp fax sax f I] dm P(m,) cos (= log 4n)| (9.53) 


l k=l 

= Koti am 

Ax-1 =1+(e%-1) J] > A 
k=l 


The free energy is written in terms of the solution of the above equation as 


r 
í 


ZF eis 
nS log2+a(1 — K) a lI dmy P(m,z) log Ax 


K "4 Kel 1 “1 
+ a I] dm, P(m x) log Ax- — = l dmP(m) log(1 — m°). (9.55) 
2 Ja a AR 


It is instructive to investigate the simplest case of K = 1 using the above 
result. When K = 1, Ao is e~? and (9.53) gives 


Of ay L+m ~ata cosluð/2) re 
P(m) = a TE ai J. du cos (5 log = =) e f (9.56) 
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Expressing the exponential cosine in terms of modified Bessel functions (using 
e089 — SI, (z)e"*?) and integrating the result over u, we find 


OO aL 
Poa = e79 ` Ipla) 6 (m — tanh =) : (9.57) 


keze Oxy 


In the interesting case of the zero-temperature limit, this equation reduces to 
: 1 = ee 
P(m) =e “Ip(a)d(m) + 5 (1 ~e “Io(a)) {6(m — 1) + d(m+1)}. (9.58) 
Inserting the distribution function (9.57) into (9.55), we obtain the free energy 


BF ab a S a a Bk g 
=m = log 2 — ES ` Ipla) log cosh a (9.59) 


k= ox 


f 


4 


which gives, in the limit 3 — o0, 


Ela) a 


w= 5 i (Io(a) + h(a). (9.60) 
This ground-state energy is positive for all positive a. It means that the K = 1 
SAT is always unsatisfiable for a > 0. The positive weight of d(m + 1) in (9.58) 
is the origin of this behaviour: since a spin is fixed to 1 (or —1) with finite 
probability (1 — e~°Io(a))/2, the addition of a clause to the already existing M 
clauses gives a finite probability of yielding a ‘false’ formula because one may 
choose a spin fixed to the wrong value as the (M + 1)th clause. If P(m) were to 
consist only of a delta function at the origin, we might be able to adjust the spin 
chosen as the (M + 1)th clause to give the value ‘true’ for the whole formula; 
this is not. the case, however. 

The coefficient e~° Jo(a) of 6(m) in (9.58) is the probability that a spin is free 
to flip in the ground state. Since a single free spin has the entropy log 2, the total 
entropy might seem to be Ne~°Jo(a) log 2, which can also be derived directly 
from (9.59). A more careful inspection suggests subtraction of Ne~® log 2 from 
the above expression to yield the correct ground-state entropy 


as =e “(Io(a) — 1) log2. (9.61) 
N 

The reason for the subtraction of Ne~® log 2 is as follows (Sasamoto, private 
communication). 

Let us consider the simplest case of M = 1, K = 1, and N arbitrary. Since 
only a single spin is chosen as the clause, the values of all the other spins do 
not affect the energy. In this sense the ground-state degeneracy is 2071 and the 
corresponding entropy is (N — 1) log 2. However, such a redundancy is clearly a 
trivial one, and we should count only the degree(s) of freedom found in the spins 
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which actually exist in the clause, disregarding those of the redundant spins not 
chosen in the clauses. Accordingly, the real degeneracy of the above example is 
unity and the entropy is zero instead of (N — 1) log 2. 

For general M (and K = 1), the probability of a spin not to be chosen in a 
clause is 1 — 1/N, so that the probability that a spin is not chosen in any M 
clauses is (1—1/N)™, which reduces to e~® as M, N — œo with a = M/N fixed. 
Thus the contribution from the redundant spins to the entropy is Ne~® log 2, 
which is to be subtracted as in (9.61). A little more careful argument on the 
probability for a spin not to be chosen gives the same answer. 

The entropy (9.61) is a non-monotonic function of a, starting from zero at 
a = 0 and vanishing again in the limit œ — oo. Thus it is positive for any positive 
a, implying a macroscopic degeneracy of the ground state, MAX 1-SAT. 

Analysis of the case K > 2 is much more difficult and the solutions are known 
only partially. We summarize the results below and refer the interested reader to 
Monasson and Zecchina (1997, 1998) and Monasson et al. (2000). If P(m) does 
not have delta peaks at m = +1, the spin states are flexible, and the condition 
F = 01A ACOu = ‘true’ can be satisfied. This is indeed the case for small a 
(easy region). The ground-state energy vanishes in this region, and the problem is 
satisfiable. There are an exponentially large number of solutions, a finite entropy. 
When a exceeds a threshold a,, P(m) starts to have delta peaks at m = +1 
continuously (K = 2) or discontinuously (K > 3) across a,. The delta peaks at 
m = +1 imply that a finite fraction of spins are completely frozen so that it is 
difficult to satisfy the condition F = ‘true’; there is a finite probability that all 
the x; in a clause are frozen to the wrong values. The K-SAT is unsatisfiable 
in this region a > a. The critical point is a, = 1 for K = 2 and a, = 4.17 
for the discontinuous case of K = 3. The latter value of 4.17 is from numerical 
simulations. It is hard to locate the transition point a, from the RS analysis for 
K > 3 because one should compare the RS and RSB free energies, the latter 
taking place for @ > ac. Comparison with numerical simulations indicates that, 
for any K > 2, the RS theory is correct for the easy region @ < a, but not for 
the difficult region a > a,. In the former region an extensive number of solutions 
exist as mentioned above, but they suddenly disappear at a. 


9.6 Simulated annealing 


Let us next discuss the convergence problem of simulated annealing. Simulated 
annealing is widely used as a generic numerical method to solve combinatorial 
optimization problems. This section is not about a direct application of the spin 
glass theory but is nonetheless closely related to it; the ground-state search of 
a spin glass system is a very interesting example of combinatorial optimization, 
and simulated annealing emerged historically through efforts to identify the cor- 
rect ground state in the complex energy landscape found typically in spin glasses 
(Kirkpatrick et al. 1983). We investigate, in particular, some mathematical as- 
pects of simulated annealing with the generalized transition probability, which 


202 OPTIMIZATION PROBLEMS 


state 


(a) (b) 


Fic. 9.4. Phase space with a simple structure (a) and a complicated structure 


(b) 


is recently attracting attention for its fast convergence properties and various 
other reasons (Abe and Okamoto 2001). 


9.6.1 Simulated annealing 


Suppose that we wish to find the optimal state (that minimizes the cost function) 
by starting from a random initial state and changing the state gradually. If the 
value of the cost function decreases by a small change of state, we accept the new 
state as the one that is actually realized at the next step, and we reject the new 
state if the cost function increases. We generate new states consecutively by this 
process until no new states are actually accepted, in which case we understand 
that the optimal state has been reached (Fig. 9.4). This idea is called the gradient 
descent method. If the phase space has a simple structure as in Fig. 9.4(a), 
the gradient descent always leads to the optimal state. However, this is not 
necessarily the case if there are local minima which are not true minima as in 
Fig. 9.4(b), because the system would be trapped in a local minimum for some 
initial conditions. 

It is then useful to introduce transitions induced by thermal fluctuations 
since they allow processes to increase the value of the cost function with a cer- 
tain probability. We thus introduce the concept of temperature T as an externally 
controllable parameter. If the cost function decreases by a small change of state, 
then we accept the new state just as in the simple gradient descent method. If, on 
the other hand, the cost function increases, we accept the new state with prob- 
ability e~4//T that is determined by the increase of the cost function Af (> 0) 
and the temperature T. In the initial stage of simulated annealing, we keep the 
temperature high, which stimulates transitions to increase the cost function with 
relatively high probability because e~4//7 is close to unity. The system searches 
the global structure of the phase space by such processes that allow the system to 
stay around states with relatively high values of the cost. function. Then, a grad- 
ual decrease of the temperature forces the system to have larger probabilities to 
stay near the optimal state with low f, which implies that more and more local 
structures are taken into account. We finally let T — 0 to stop state changes 
and, if successful, the optimal state will be reached. Simulated annealing is the 
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idea to realize the above process numerically to obtain an approximate solution 
of a combinatorial optimization problem. One can clearly reach the true optimal 
state if the temperature is lowered infinitesimally slowly. In practical numeri- 
cal calculations one decreases the temperature at a finite speed and terminates 
the process before the temperature becomes exactly zero if a certain criterion 
is satisfied. Simulated annealing is an approximate numerical method for this 
reason. 


9.6.2 Annealing schedule and generalized transition probability 


An important practical issue in simulated annealing is the annealing schedule, the 
rate of temperature decrease. If the temperature were decreased too rapidly, the 
system would be trapped in a local minimum and lose a chance to escape there 
because a quick decrease of the temperature soon inhibits processes to increase 
the cost function. If the temperature is changed sufficiently slowly, on the other 
hand, the system may be regarded to be approximately in an equilibrium state at 
each T and the system therefore reaches the true optimal state in the limit T -- 0. 
It is, however, impracticable to decrease the temperature infinitesimally slowly. 
Thus the problem of the annealing schedule arises in which we ask ourselves how 
fast we can decrease the temperature without being trapped in local (not global) 
minima. 

Fortunately, this problem has already been solved in the following sense (Ge- 
man and Geman 1984; Aarts and Korst 1989). When we allow the system to 
increase the cost function with probability e~4//7 as mentioned before, the sys- 
tem reaches the true optimal state in the infinite-time limit t — oo as long as the 
temperature is decreased so that it satisfies the inequality T(t) > c/log(t + 2). 
Here c is a constant of the order of the system size N. It should, however, be 
noted that the logarithm log(t + 2) is only mildly dependent on t and the lower 
bound of the temperature does not approach zero very quickly. Thus the above 
bound is not practically useful although it is theoretically important. 

An inspection of the proof of the result mentioned above reveals that this 
logarithmic dependence on time ¢ has its origin in the exponential form of the 
transition probability e~4//7. It further turns out that this exponential func- 


equilibrium state at temperature T. 

However, it is not necessary to use the exponential transition probability, 
which comes from the equilibrium distribution function, because we are inter- 
ested only in the limit T — 0 in combinatorial optimization problems. The 
only requirement is to reach the optimal state in the end, regardless of inter- 
mediate steps. Indeed, numerical investigations have shown that the generalized 
transition probability to be explained below has the property of very rapid con- 
vergence to the optimal state and is now used actively in many situations (Abe 
and Okamoto 2001). We show in the following that simulated annealing using 
the generalized transition probability converges in the sense of weak ergodicity 
under an appropriate annealing schedule (Nishimori and Inoue 1998). The proof 
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given here includes the convergence proof of conventional simulated annealing 
with the exponential transition probability as a special case. 


9.6.3. Inhomogeneous Markov chain 


Suppose that we generate states one after another sequentially by a stochastic 
process starting from an initial state. We consider a Markov process in which the 
next state is determined only by the present state, so that we call it a Markov 
chain, Our goal in the present section is to investigate conditions of convergence 
of the Markov chain generated by the generalized transition probability explained 
below. We first: list various definitions and notations. 

The cost function is denoted by f that is defined on the set of states (phase 
space) S. The temperature T is a function of time, and accordingly the transition 
probability Œ from a state x (€ S) to y is also a function of time t (= 0,1, 2,...). 
This defines an inhomogeneous Markov chain in which the transition probability 
depends on time. This G is written as follows: 


an J Pe WAe yi TO) (x #y) , 
CD ee Pe syAlesiT) (ey. 0) 


Here P(x,y) is the generation probability (probability to generate a new state) 


>0 (y ES,) 


P(z,y) i =0 (otherwise), 09) 


where Sz is the neighbour of x, the set of states that can be reached by a single 
step from x. In (9.62), A(x, y; T) is the acceptance probability (probability by 
which the system actually makes a transition to the new state) with the form 


A(x, y; T) = min{1, ulz, y; T)} 


= 1/(1—q) : 
uT) (1 +(q- yf) g (9.64) 


Here q is a real parameter and we assume q > 1 for the moment. Equation (9.62) 
shows that one generates a new trial state y with probability P(x,y), and the 
system actually makes a transition to it with probability A(x, y; T}. According 
to (9.64), when the change of the cost function Af = f(y) — f(a) is zero or 
negative, we have u > 1 and thus A(x, y; T) = 1 and the state certainly changes 
to the new one. If Af is positive, on the other hand, u < 1 and the transition to y 
is determined with probability u(x, y; T). If Af < 0 and the quantity in the large 
parentheses in (9.64) vanishes or is negative, then we understand that u — oo 
or A = 1. It is assumed that the generation probability P(x,y} is irreducible: it 
is possible to move from an arbitrary state in S to another arbitrary state by 
successive transitions between pairs of states « and y satisfying P(x,y) > 0. It 
should be remarked here that the acceptance probability (9.64) reduces to the 
previously mentioned exponential form e~4//" in the limit q — 1. Equation 
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Now we choose the annealing schedule as follows: 


Lh) = Sec en he > Ub N oa (9.65) 
(é + 2)¢ | 

In the convergence proof of conventional simulated annealing corresponding to 

q — 1, a logarithm of t appears in the denominator of (9.65). Here in the case 

of the generalized transition probability (9.62)~—(9.64), it will turn out to be 
appropriate to decrease T as a power of t. 

It is convenient to regard G(x, y; t) as a matrix element of a transition matrix 

G(t): 
IGOlay = G(x, y;¢). (9.66) 


We denote the set of probability distributions on S by P and regard a probability 
distribution p as a row vector with element p(x) (x € S). If at time s the system 
is in the state described by the probability distribution po (€ P), the probability 
distribution at time t is given as 


p(s, t) = poG®* = poG(s)G(s + 1)...G(t — 1). (9.67) 


We also introduce the coefficient of ergodicity as follows, which is a measure of 
the state change in a single step: 


a(G®t) = 1 — min lz min{G*""(2, z), G* (y, z)}la,y € s} l (9.68) 
ZES 


In the next section we prove weak ergodicity of the system, which means that 
the probability distribution is asymptotically independent of the initial condition: 


Vs>0: jim, sup{||pi(s,t) — pe(s,t)|] | por, poz € P} = 0, (9.69) 


where p;(s,t) and po(s,t) are distribution functions with different initial condi- 
tions, 


pi(s,t) = poiG*", pe(s,t) = porG?". (9.70) 
The norm of the difference of probability distributions in (9.69) is defined by 
lpi — pall = So pale) — pa(e)}. (9.71) 
Les 


Strong ergodicity to be contrasted with weak ergodicity means that the proba- 
bility distribution approaches a fixed distribution irrespective of the initial con- 
dition: 

dre P,vs > 0: jim. sup{||p(s, £) — r|| | po € P} = 0. (9.72) 


The conditions for ergodicity are summarized in the following theorems (Aarts 
and Korst 1989). 
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Theorem 9.1. (Weak ergodicity) An inhomogeneous Markov chain is weakly 
ergodic if and only if there exists a monotonically increasing sequence of integers 


O < to <t <e Lti < tipt Lee 


and the coefficient of ergodicity satisfies 


Soa — a(Gt+1)) = o0. (9.73) 


i=0 
Theorem 9.2. (Strong ergodicity) An inhomogeneous Markov chain is strong- 
ly ergodic if 
e it is weakly ergodic, 
e there exists a stationary state p = p,G(t) at each t, and 
è the above p, satisfies the condition 


y pe — Pt+1|| < 00. (9.74) 
t=0 


9.6.4 Weak ergodicity 


We prove in this section that the Markov chain generated by the generalized 
transition probability (9.62)~-(9.64) is weakly ergodic. The following lemma is 
useful for this purpose. 

Lemma 9.3. (Lower bound to the transition probability) The transition 
probability of the inhomogeneous Markov chain defined in 89.6.3 satisfies the fol- 
lowing inequality. Off-diagonal elements of G for transitions between different 
states satisfy 


(q—1)L 1/(1~q) 
P(x,y) > 0 = Yt > 0: G(x, yt) aw (1 + G) i (9.75) 


For diagonal elements we have 


Yr E S\ S™, Ft PE E eas Cee tee O 
TE ‘ 3 ti >0, t> ty . (£, xt) > w 1+" TO ‘ (9.76) 


Here S™ is the set of states to locally maximize the cost function 
S™ = {ala € &, Vy € Se: f(y) < f(z)h, (9.77) 
L is the largest value of the change of the cost function by a single step 
L = max{|f(x) — f(y)| | P(#, y) > 0}, (9.78) 
and w is the minimum value of the non-vanishing generation probability 


w = min{P(2,y) | P(z,y) > 0,2,y € S} (9.79) 
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Proof We first prove (9.75) for off-diagonal elements. If f(y) — f(x) > 0, then 
u(x, y; T(t)) < 1 and therefore 


I 


G(x, y;t) = P(z, y Alz y T(t) > w min{1, ufe y; TŒ) )} 


q — a) 1/(1—) 


= wule. u T >w i 9, 
wule, y; T(t) >u (1 -+ TE (9.80) 


When f(x) — f(y) <0, u(x, y; T(t)) > 1 holds, leading to 


(q = DL 1/(1~q) 
G(z,y;t) > w min{1, uler, y TŒ) = w > w (1 + toe) . (9.81) 


To prove the diagonal part (9.76), we note that there exists a state yy € Sz 
to increase the cost function f(y) — f(x) > 0 because of £ € S\ S~. Then 


lim ulz, y; T(t)) = 0, (9.82) 
t+ 00) . 
and therefore 
jim min{1, u(x, y;T(t))} = 0. (9.83) 
caren 


For sufficiently large t, min{1, u(x, y; T(t))} can be made arbitrarily small, and 
hence there exists some tı for an arbitrary € > 0 which satisfies 


vt > tı : min{1, u(x, y; T(t))} < €. (9.84) 
Therefore 
XO Pla, 2)A(x, 2; T(t)) 
z€S 
= pa P(x, y+) min{1, ule, y;T(t))} + 5 P(x, z) min{1, u(x, z; T(t))} 
{y+} z€S\{y+} 
<> Pleulet Jo Plaz)=—-(1-6) J Plays) +1. (9.85) 


From this inequality and (9.62) the diagonal element satisfies 


(q — 1)L 1/(1—q) 
G(x, x; t) > (1 om €) > P(x, y+) > w (1 -+ Sa) ‘ (9.86) 
{y+} 


In the final inequality we have used the fact that the quantity in the large 
parentheses can be chosen arbitrarily small for sufficiently large t. im 


208 OPTIMIZATION PROBLEMS 


It is convenient to define some notation to prove weak ergodicity. Let us write 
d(x,y) for the minimum number of steps to make a transition from x to y. The 
maximum value of d(x,y) as a function of y will be denoted as k(x): 

k(x) = max{d(x, y)|y € S}. (9.87) 
Thus one can reach an arbitrary state within k(x) steps starting from x. The 
minimum value of k(x) for x such that x € S\S™ is written as R, and the x to 
give this minimum value will be «*: 


R= min{k(z)|z¢S\S™"}, z* = argmin{k(z)|z € S\ S™}. (9.88) 


Theorem 9.4. (Weak ergodicity : generalized transition probability) The 
inhomogeneous Markov chain defined in 89.6.3 is weakly ergodic if 0 < e < 
(q—1)/R. 
Proof Let us consider a transition from x to x*. From (9.67), 
Gt- Rte, o*) = 5 G(x, x1;t— R)G(a1,29;t -R+1)...G(ep_-1,27°;3t—1). 
Ehr E Rel 

(9.89) 
There exists a sequence of transitions to reach x* from x within R steps according 
to the definitions of «* and R, 


TAL $22 Fit F Ek = Tkp =H ER = T. (9.90) 
We keep this sequence only in the sum (9.89) and use Lemma 9.3 to obtain 
Gt- Pt (x, et) > G(æ,x1;t — R)G(£1,£23;t -R+1)...Glar_1,2R;t — 1) 


R 1/(1-4) 
(gD. 
> l | EEES ent aces 
j (1 T(t-R+k-—-1) 


ke1 
a 1)L R/(i-q) 
> wR (ry SoBe . 
>w (G+) (9.91) 


It therefore follows that the coefficient of ergodicity satisfies the inequality 


a(Gt-%t) = 1 — min 2 min{G t(x, z), Gt y, z)}\e,y € s} 


zS 
<1- min{min{G (x, 2*), Gt (y, 2*)}|2,y € S} 
R/A—@¢ 

Cove (ea) 
Tt —1) 
We now use the annealing schedule (9.65). According to (9.92) there exists a 
positive integer ko such that the following inequality holds for any integer k 
satisfying k > ko: 


<i-w" (14 (9.92) 


= . 


1 ma a ae > w” (1 $ ( 7 
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ees ey R/(1-4) 
> we (Ager (+3) } (9.93) 


It is then clear that the following quantity diverges when 0 < c < (q — 1)/R: 


S GE RER = ^ (1 ogre MER Vg. Ss" (1 (ie RREY, (9.94) 
k=0 k=0 kko 
This implies weak ergodicity from Theorem 9.1. Oo 


This proof breaks down if q < 1 since the quantity in the large parentheses of 
(9.64) may be negative even when Af > 0. In numerical calculations, such cases 
are treated as u = 0, no transition. Fast relaxations to the optimal solutions 
are observed often for q < 1 in actual numerical investigations. It is, however, 
difficult to formulate a rigorous proof for this case. 

It is also hard to prove a stronger result of strong ergodicity and approach 
to the optimal distribution (distribution uniform over the optimal states) for 
the Markov chain defined in §9.6.3 for general g. Nevertheless, weak ergodicity 
itself has physically sufficient significance because the asymptotic probability 
distribution does not depend on the initial condition; it is usually inconceivable 
that such an asymptotic state independent of the initial condition is not the 
optimal one or changes with time periodically. 


9.6.5 Relaxation of the cost function 

It is not easy to prove the third condition of strong ergodicity in Theorem 9.2, 
(9.74), for a generalized transition probability with q 4 1. However, if we restrict 
ourselves to the conventional transition probability e~4//? corresponding to 
q — 1, the following theorem can be proved (Geman and Geman 1984). 


Theorem 9.5. (Strong ergodicity : conventional transition probability) 
If we replace (9.64) and (9.65) in §9.6.3 by 


u(x, y; T) = exp{—(f(y) — f(x))/T} (9.95) 
T(t) > ma (9.96) 


then this Markov chain is strongly ergodic. The probability distribution in the 
limit t — co converges to the optimal distribution. Here R and L are defined as 
in §9.6.4. 

To prove this theorem, we set q — 1 in (9.92) in the proof of Theorem 9.4. By 
using the annealing schedule of (9.96), we find that (9.94) diverges, implying weak 
ergodicity. It is also well known that the stationary distribution at temperature 


condition of Theorem 9.2 is satisfied. Some manipulations are necessary to prove 
the third convergence condition (9.74), and we only point out two facts essential 
for the proof: the first is that the probability of the optimal state monotonically 
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increases with decreasing temperature due to the explicit form of the Gibbs- 
Boltzmann distribution. The second one is that the probabilities of non-optimal 
states monotonically decrease with decreasing temperature at sufficiently low 
temperature. 

A comment is in order on the annealing schedule. As one can see in Theorem 
9.4, the constant c in the annealing schedule (9.65) is bounded by (q — 1)/R, but 
R is of the order of the system size N by the definition (9.88) of R.?° Then, as 
N increases, c decreases and the change of T(t) becomes very mild. The same 
is true for (9.96). In practice, one often controls the temperature irrespective 
of the mathematically rigorous result as in (9.65) and (9.96); for example, an 
exponential decrease of temperature is commonly adopted. If the goal is to obtain 
an approximate estimation within a given, limited time, it is natural to try fast 
annealing schedules even when there is no guarantee of asymptotic convergence. 

It is instructive to investigate in more detail which is actually faster between 
the power decay of temperature (9.65) and the logarithmic law (9.96). The time 
tı necessary to reach a very low temperature 6 by (9.65) is, using b/t œ ô (where 


ty œ% exp (- L i log 3) (9.97) 
Here we have set R = kiN. For the case of (9.96), on the other hand, from 
ka N/ log te = 6, 
kə N 
to & exp (=) : (9.98) 


Both of these are of exponential form in N, which is reasonable because we are 
discussing generic optimization problems including the class NP complete. An 
improvement in the case of the generalized transition probability (9.97) is that 
ô appears as logô whereas it has 1/d-dependence in t2, which means a smaller 
coefficient of N for small 6 in the former case. 

It should also be remarked that smaller temperature does not immediately 
mean a smaller value of the cost function. To understand this, note that the 
acceptance probability (9.64) for q Æ 1 is, when T =ô <1, 


u(T = 8) S (pa) aai À (9.99) 
(a= 1)Af | 
while for q = 1, according to (9.95), 
u(T = 8) xe 9, (9.100) 


For transitions satisfying Af/d >> 1, we have u;(d) > ue(d). This implies that 
transitions to states with high cost function values take place easier for q Æ 1 than 
for q = 1 if the temperature is the same. Transition probabilities for q # 1 cause 


23The number of steps to reach arbitrary states is at least of the order of the system size. 
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Fic. 9.5. Potential and barriers in one dimension 


state searches among wide regions in the phase space even at. low temperatures. 
It would follow that the equilibrium expectation value of the cost function after 
a sufficiently long time at a fixed temperature is likely to be higher in the case 
q #1 than in gq = 1. In numerical experiments, however, it is observed in many 
cases that q Æ 1 transition probabilities lead to faster convergence to lower values 
of the cost function. The reason may be that the relaxation time for q Æ 1 is 
shorter than for g = 1, which helps the system escape local minima to relax 
quickly towards the real minimum. 


9.7 Diffusion in one dimension 


The argument in the previous section does not directly show that we can reach 
the optimal state faster by the generalized transition probability. It should also 
be kept in mind that we have no proof of convergence in the case of q < 1. In 
the present section we fill this gap by the example of diffusion in one dimension 
due to Shinomoto and Kabashima (1991). It will be shown that the generalized 
transition probability with q < 1 indeed leads to a much faster relaxation to the 
optimal state than the conventional one q = 1 (Nishimori and Inoue 1998). 


9.7.1 Diffusion and relaxation in one dimension 
Suppose that a particle is located at one of the discrete points x = ai (with i 
integer and a > 0) in one dimension and is under the potential f(x) = x?/2. 
There are barriers between the present and the neighbouring locations i+ 1. The 
height is B to the left (i — i — 1) and B + A, to the right (i — i + 1) as shown 
in Fig. 9.5. Here A; is the potential difference between two neighbouring points, 
A; = f(ali + 1)) — f (ai) = ax + a?/2. 

The probability P (i) that the particle is located at z = ai at time t follows 
the master equation using the generalized transition probability: 


Dfi 1/(1—4) 
nw (1 +(q~ DŽ) Pii +1) 
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B Ay 1/Ul—-q) 
(110-9) RG- 


~ UO@) 
-(1+a- 07) Ro 
1/(1—q) 
= (1 +(q- DŽ) P,(i). (9.101) 


The first term on the right hand side represents the process that the particle 
at i+ 1 goes over the barrier B to i, which increases the probability at i. The 
second term is for i—1— i, the third for i — i+1, and the fourth for i — i — 1. 
It is required that the transition probability in (9.101) should be positive semi- 
definite. This condition is satisfied if we restrict q to q = 1—(2n)~! (n = 1,2,...) 
since the power 1/(1 — q) then equals 2n, which we therefore accept here. 

It is useful to take the continuum limit a — 0 to facilitate the analysis. Let 
us define y(T) and D(T) by 


i 1 B q/ (1q) 7 B 1/(1—q) 
(0) = 2 (1 fy DŽ) l pr) = (1+ @- DF) (9.102) 


and expand the transition probability of (9.101) to first order in A; and Aj..4 (« 
a) to derive 


ao = D(T) {P,(i +1) — 2P,(é) + Pli-V)} 
+ ay(T) {sP (i) + 5 Pili) —2P,(i—1) + SPei—1)}. (9.103) 


We rescale the time step as a?t > t and take the limit a — 0, which reduces the 
above equation to the form of the Fokker—Planck equation 
OP nô E: a 
ap OEP + DT a5 
We are now ready to study the time evolution of the expectation value of the 
cost function y: 


(9.104) 


y(t) = fae + (2) Pl et), (9.105) 


The goal is to find an appropriate annealing schedule T(t) to reduce y to the 
optimal value 0 as quickly as possible. The time evolution equation for y can be 
derived from the Fokker~Planck equation (9.104) and (9.105), 
d? a a i 
7 = —27(T)y + D(T). (9.106) 


Maximization of the rate of decrease of y is achieved by minimization of the 
right hand side of (9.106) as a function of T at each time (or maximization of 
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the absolute value of this negative quantity). We hence differentiate the right 
hand side of (9.106) with respect to T using the definitions of y(T) and D(T), 
(9.102): 


2yB + (1—q)B? 


Toot = oy 4B = (1 —q)B + 2qy + Oly’). (9.107) 
Thus (9.106) is asymptotically, as y ~ 0, 
|: 2 q/(1—q) ; 
r = 2B" (1-1) (5 = Jame (9.108) 
which is solved as 
j] — q 1/q 
y = Ba (=) pa (9.109) 


Substitution into (9.107) reveals the optimal annealing schedule as 
Topt © (1 — g)B + const t7 79/4, (9.110) 


Equation (9.109) indicates that the relaxation of y is fastest when q = 1/2 (n = 
1), which leads to 
ae a a 
we —t!, To © > t+—tl. 9.111 
~- AE P 
The same analysis for the conventional exponential transition probability 
with q — 1 in the master equation (9.101) leads to the logarithmic form of 


relaxation (Shinomoto and Kabashima 1991) 


ESIE =; Coa B 
y m log t’ opt ~ logt 
Comparison of (9.111) and (9.112) clearly shows a faster relaxation to y = 0 in 
the case of q = 1/2. 

A remark on the significance of temperature is in order. Top, in (9.107) ap- 
proaches a finite value (1 — q)B in the limit t -> oo, which may seem unsatis- 
factory. However, the transition probability in (9.101) vanishes at T = (1—q)B, 
not at T = 0, if q # 1,a — 0, and therefore T = (1 — q)B effectively plays the 
role of absolute zero temperature. 


(9.112) 


Bibliographical note 

The application of statistical mechanics to optimization problems started with 
simulated annealing (Kirkpatrick et al. 1983). Review articles on optimization 
problems, not just simulated annealing but including travelling salesman, graph 
partitioning, matching, and related problems, from a physics point of view are 
found in Mézard et al. (1987) and van Hemmen and Morgenstern (1987), which 
cover most materials until the mid 1980s. A more complete account of simu- 
lated annealing with emphasis on mathematical aspects is given in Aarts and 
Korst (1989). Many optimization problems can be formulated in terms of neural 
networks. Detailed accounts are found in Hertz et al. (1991) and Bishop (1995). 


APPENDIX A 
EIGENVALUES OF THE HESSIAN 


In this appendix we derive the eigenvalues and eigenvectors of the Hessian dis- 
cussed in Chapter 3. Let us note that the dimensionality of the matrix G is equal 
to the sum of the spatial dimension of e® and that of 77, n + n(n — 1)/2 = 
n(n + 1)/2. We write the eigenvalue equation as 


cu =n, w= (od N (A.1) 


The symbol {e°} denotes a column from e! at the top to e” at the bottom and 
{nf} is for n? tont”, 


A.1 Eigenvalue 1 


There are three types of eigenvectors. The first one yz, treated in the present 
section has the form e® = a,n°° = b. The first row of G is written as 


(Ap Boce Br Cras Cy ery), (A.2) 
so that the first row of the eigenvalue equation Gu, = Ap, is 


Aa + (n — 1)Ba + (n — 1)Cb + (n —1)(n — 2)Db = Ma. (A.3) 


The lower half of the same eigenvalue equation (corresponding to {nef} is, using 
the form of the corresponding row of G, (C,C,D,...,D,P,Q,...,Q,R,..., R), 


2C'a + (n — 2)Da + Pb + 2(n — 2)Qb + s(n 2)(n — 3)Rb = àb. (A.4) 


The factor of 2 in front of the first C comes from the observation that both Galas) 
and Gag) are C for fixed (a). The factor (n — 2) in front of D reflects the 
number of replicas y giving Gy.) = D. The 2(n —2) in front of Q is the number 
of choices of replicas satisfying G(ag)(ay) = Q, and similarly for (n — 2)(n —3)/2. 
The condition that both (A.3) and (A.4) have a solution with non-vanishing a 
and b yields 


M= a(x + /¥2 +42), (A.5) 


X=A+(n-1B+P+22%n-2QO+ (n — 2)(n — 3)R (A.6) 
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Y= AiG aD) BP Un 20% 5 -2)(n-3)R—(A7) 
Z = Un — 1){2C + (n ~ JDP. (A.8) 


This eigenvalue reduces in the limit n — 0 to 


T =5{4-B+P-4Q+3R+ (4-B-P+4Q-3R} -8(C- DP}. 


(A.9) 


A.2 Eigenvalue 2 

The next type of solution p has e? = a (for a specific replica 0), e% = b (other- 
wise) and 7%? = c (when a or 8 is equal to @), n°? = d (otherwise). We assume 
0 = 1 without loss of generality. The first row of the matrix G has the form 
(A,B...,B,C,...,C,D,...,D). Both B and C appear n — 1 times, and D ex- 
ists (n —1)(n — 2)/2 times. Vector fey is written as ‘(a,b,...,b,c,...,¢,d,...,d), 
where there are n — 1 of b and c, and (n — 1)(n — 2)/2 of d. The first row of the 
eigenvalue equation GH = A2fo is 


Aa + (n — 1)Bb + Cefn — 1) + 5Da(n — 1)(n — 2) = Aa. (A.10) 


The present vector u should be different from the previous jz; and these vectors 
must be orthogonal to each other. A sufficient condition for orthogonality is 
that the upper halves (with dimensionality n) of jz; and u have a vanishing 
inner product and similarly for the lower halves. Then, using the notation p; = 
(a, E, EYY- Y), we find 


a-+ (n — 1)b = 0, e+ 5(n—2)d=0. (A.11) 
Equation (A.10) is now rewritten as 
(A — àa — B)a+(n-1)(C ~ D)e = 0. (A.12) 


We next turn to the lower half of the eigenvalue equation corresponding to 
{n°2}. The relevant row of G is (C,C,D,...,D,P,Q,...,Q,R,...,R), where 
there are n — 2 of the D, 2(n — 2) of the Q, and (n — 2)(n — 3)/2 of the R. The 
eigenvector pt, has the form '(a,b,...,6,c,...,¢,d,...,d). Hence we have 


aC+6C +(n-2)Db+ Pe+(n—2)Qe+(n—2)Qd+ 5 (n—2)(n—3)Rd = roc. (A.13) 


This relation can be written as, using (A.11), 
n—2 
n=- 1 


The condition that (A.12) and (A.14) have non-vanishing solution yields 


he = 5(X + /¥2 +Z), (A.15) 


(C — Dja+ {P+ (n —4)Q ~ (n — 3)R — Ag}c = 0. (A.14) 


216 EIGENVALUES OF THE HESSIAN 


X=A-~B+P+(n-4Q-(n-3R (A.16) 
Y=A-—B-—P—(n-—4)Q4+(n—-3)R (A.17) 
Z =A= C =D)’. (A.18) 


This eigenvalue becomes degenerate with A; in the limit n — 0. 

There are n possible choices of the special replica 0, so that we may choose n 
different eigenvectors p. Dimensionality n corresponding to {6°} from n(n+1)/2 
dimensions has thus been exhausted. Within this subspace, the eigenvectors p4 
and jy cannot all be independent as there are no more than n independent 
vectors in this space. Therefore we have n independent vectors formed from pry 
and ply. If we recall that A, and Az are both doubly degenerate, the eigenvectors 
Hı and Ho are indeed seen to construct a 2n-dimensional space. 


A.3 Eigenvalue 3 
The third type of eigenvector jz; has e? = a, e” = a (for two specific replicas 0, v) 
and e% = b (otherwise), and n°” = c, nf% = n”® = d and n°? = e otherwise. We 
may assume @ = 1,v = 2 without loss of generality. 

A sufficient condition for orthogonality with u} = '(z,...,2,y...,y) gives 


1 
2a + (n —2)b =0, c+2(n—2)d4 5 (n 2)(n —3)e = 0. (A.19) 


To check a sufficient condition of orthogonality of p3 and po, we write fo = 
HEY, YU, VW, Ww) and obtain 


ax + ay+(n—2)by = 0, cv + (n —2)dv = 0, (n —2)dw + (n 2)(n —3)ew = 0. 


(A.20) 
From this and the condition x + (n — 1)y = 0 as derived in (A.11), we obtain 


a-b=0, c+(n—-2)d=0, d+ 5(n—3)e=0. (A.21) 


From (A.19) and (A.21), a = b = 0,c = (2 — n)d,d = (3 — n)e/2. This relation 
reduces the upper half of the eigenvalue equation (corresponding to {e°}) to 
the trivial form 0 = 0. The relevant row of G is (...,P,Q,...,Q,R,...,R) and 
H3 = '(0,...,0,¢,d,...,d,e,...,e). Thus the eigenvalue equation is 


1 i 
Pe + 2(n — 2)Qd + z^ — 2) (n — 3) Re = Age, (A.22) 
which can be expressed as, using (A.21), 
às = P —2Q + R. (A.23) 


The degeneracy of Az may seem to be n(n — 1)/2 from the number of choices of 
0 and v. However, n vectors have already been used in relation to 41, Ao and the 
actual degeneracy (the number of independent vectors) is n(n — 3)/2. Together 
with the degeneracy of A; and A2, we have n(n + 1)/2 vectors and exhausted all 
the eigenvalues. 


APPENDIX B 
PARISI EQUATION 


We derive the free energy in the full RSB scheme for the SK model in this 
appendix following Duplantier (1981). The necessary work is the evaluation of 
the term Tre” in the free energy (2.17). We set 8 = J = 1 during calculations 
and retrieve these afterwards by dimensionality arguments. 

As one can see from the form of the matrix (3.25) in 83.2.1, the diagonal blocks 
have elements gx. Thus we may carry out calculations with the diagonal element 
doo kept untouched first in the sum of qap SSP and add 6?.J?qx /2 (ax — q(1)) 
to Of later to cancel this extra term. We therefore evaluate 


G = Tr exp P? qag S| SË +h 3 Ss 
a, B= 


= exp DET The [pect | 7 (B.1) 


If all the qag are equal to q (the RS solution), the manipulation is straightforward 
and yields 


G = exp - (2 cosh A)” (B.2) 
2 Ah? f 
where we have used 
OP iy ess Bn) E Əf(h,..., h) A 
2 ha d 7 ðh i (Bia) 


In the case of 2RSB, we may reach the n x n matrix {qag} in three steps by 
increasing the matrix dimension as M2, My, n: 
(2-1) (q2 — qM (mə2). Here (m2) is an mg X mg matrix with all elements unity. 
(2-2) (q2 a qi)Diag,,,, T(ma)| T (qı pi qoM (mı). The matrix Diag m, [I(m2)] has 
dimensionality mı x mı with all diagonal blocks equal to (m2) and all the 
other elements zero. The second term (q1 — qoM (mı) specifies all elements 
to qı — qo, and the first term replaces the block-diagonal part by q2 — qo. 
(2-3) (q2 — qı)Diag,[Diagm, [1(m2)]] + (qı — go)Diagn [7 (m1)] + goL(n). All the 
elements are set first to go by the third term. The second term replaces 
the diagonal block of size mı x mi by qı. The elements of the innermost 
diagonal block of size mz x mz are changed to qo by the first term. 
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Similarly, for general K-RSB, 

(K-1) (qx —@ax-1)l(mK) (mK X mK matrix), 

(K-2) (qx —qx-1)Diag._1[2(mx)]+(ax-1—-@x-2)f(mx-1) (me-1XmxK-1 
matrix), 

and so on. Now, suppose that we have carried out the trace operation for the 

mK x mg matrix determined by the above procedure. If we denote the result 

as g(mx,h), since the elements of the matrix in (K-1) are all gx — qx 1 corre- 

sponding to RS, we have from (B.2) 


ey? 


ot 


g(mg, h) = exp {plas = aaia) (2 cosh A)”*. (B.4) 

The next step is (K-2). The matrix in (K-2) is inserted into gag of (B.1), 
and the sum of terms (qx — qx-1)Diagx_,[[(m)] and (¢x~-1 ~ ¢x—2)I(mK-1) 
is raised to the exponent. The former can be written by the already-obtained 
g(mx,h) in (B.4) and there are mx_|/mx of this type of contribution. The 
latter has uniform elements and the RS-type calculation applies. One therefore 
finds that g(mx—1,h) can be expressed as follows: 


“ey? 


1 8? ee 
g(mK-1,h) = exp fiar = araia) lo(mx, are, (B.5) 


Repeating this procedure, we finally arrive at 


1 a 
G = g(n, h) = exp l 310) 5a} [g(ma, hy)". (B.6) 
In the limit n — 0, the replacement mj —m,j—1 = —dz is appropriate, and (B.5) 
reduces to the differential relation 
92 
g(x + da, h) = exp EOT guh RER, (B.7) 


In (B.4) we have mg — 1,qg —qgKr-1 — 0 and this equation becomes g(1, h) = 
2coshh. Equation (B.7) is cast into a differential equation 


Og ldq 07g 1 
ôx 2da Oh? | 7) O89 Ee) 


which may be rewritten using the notation fo(x,h) = (1/2) log g(a, h) as 


Ofo dq J 0? fo Ofo\* 
ðz dx ane F Oh i (B.9) 


By taking the limit n — 0, we find from (B.6) 


eee E D E. 
= log Tre” = exp (50 ae r log g(a, n) 


x, h—0 
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I aloe hY 
= exp 31) aps fol, A) |p .0 
= [ou fo(0, Vq(0)u). (B.10) 
We have restricted ourselves to the case h = 0. The last expression can be 


confirmed, for example, by expanding fo(0, h) in powers of h. The final expression 
of the free energy (2.17) is 


2 72 “1 ` 
Bf = B = fı l q(x} dz — 24(1)} — / Du fo(0, /g(0)w). (B.11) 
J0 ` 


Here fo satisfies the following Parisi equation: 


Ofo(x,h) «SI? dq J} A fo Ih” 
ðr 2 dz | ôk? ae Oh ey) 


under the initial condition fo(1,h) = log 2 cosh Gh. The parameters 8 and J have 
been recovered for correct. dimensionality. 


APPENDIX C 
CHANNEL CODI 


IG THEOREM 


In this appendix we give a brief introduction to information theory and sketch 
the arguments leading to Shannon's channel coding theorem used in Chapter 5. 


C.1 Information, uncertainty, and entropy 


Suppose that an information source U generates a sequence of symbols (or alpha- 
bets) from the set {a;,@2,...,az} with probabilities p1, p2,..., pg, respectively. 
A single symbol a; is assumed to be generated one at a time according to this 
independently, identically distributed probability. The resulting sequence has the 
form of, for example, a2as5aj;a1.... 

The entropy of this information source is defined by 


L 
H(U) = — 5 pi logy p; [bit/symbol]. (C.1) 


isa] 


This quantity is a measure of uncertainty about the outcome from the source. 
For example, if all symbols are generated with equal probability (p = © = 
pL = 1/L), the entropy assumes the maximum possible value H = log, L as can 
be verified by extremizing H(U) under the normalization condition $7, p; = 1 
using the Lagrange multiplier. This result means that the amount of information 
obtained after observation of the actual outcome (a1, for instance) is largest in 
the uniform case. Thus the entropy may also be regarded as the amount of 
information obtained by observing the actual outcome. The other extreme is 
the case where one of the symbols is generated with probability 1 and all other 
symbols with probability 0, resulting in H = 0. This vanishing value is also 
natural since no information is gained by observation of the actual outcome 
because we know the result from the outset (no uncertainty). The entropy takes 
intermediate values for other cases with partial uncertainties. 

The unit of the entropy is chosen to be [bit /symbol], and correspondingly the 
base of the logarithm is two. This choice is easy to understand if one considers 
the case of L = 2 and pı = po = 1/2. The entropy is then H = log, 2 = 1, which 
implies that one gains one bit of information by observing the actual outcome 
of a perfectly randomly generated binary symbol. 

A frequently used example is the binary entropy Ho(p). A symbol (e.g. 0) is 
generated with probability p and another symbol (e.g. 1) with 1 — p. Then the 
entropy is 

H2(p) = —p log, p — (1 — p) loga(1 — p). (C.2) 


220) 


CHANNEL CAPACITY 221 


The binary entropy H(p) is convex, reaches its maximum Ho = 1 at p = 1/2, 
and is symmetric about p = 1/2. 


C.2 Channel capacity 


To discuss the properties of a transmission channel, it is convenient to introduce 
a few quantities related to entropy. The first one is the conditional entropy 
H(X|Y) that is a measure of uncertainty about the set of events X given another 
event y € Y. For a given conditional probability P(x|y), the following quantity 
measures the uncertainty about X, given y: 


H(X|Y = y) = — $ P(aly) log, P(zly). (C.3) 


The conditional entropy is defined as the average of H(X |y) over the distribution 
of y: 


H(X|Y) 


i 


> Py) H(X|y) 
=-S P(y) X P(aly) logy P(zly) 
y £ 


H 


- SOD. Pla, y) logs P(zly), (C.4) 
s y 
where we have used P(x, y) = P(2|y)P(y). Similarly, 


H(Y|X) = - $ 7 P(x, y) logs P(yle). (C5) 
z y 


One sometimes uses the joint entropy for two sets of events X and Y although 
it does not appear in the analyses of the present book: 


H(X,Y)=-_X_ P(z, y) logs P(z,y). (C.6) 


It is straightforward to verify the following relations 
A(X, Y) = A(Y)+ A(X|Y) = H(X) + A(Y|X) (C.7) 


from the identity P(x,y) = P(aly)P(y) = P(y|z)P(a). 
The mutual information is defined by 


1(X,Y) = H(X) — H(X|Y). (C.8) 


The meaning of this expression is understood relatively easily in the situation of 
a noisy transmission channel. Suppose that X is the set of inputs to the channel 
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A(X) 


H(X\Y) X,Y) 


Fic. C.1. Entropy H(X), conditional entropy H(X|Y), and mutual information 
(X,Y) 


and Y is for the output. Then H(X) represents uncertainty about the input with- 
out any observation of the output, whereas H(X|Y) corresponds to uncertainty 
about the input after observation of the output. Thus their difference I( X,Y) is 
the change of uncertainty by learning the channel output, which may be inter- 
preted as the amount. of information carried by the channel. Stated otherwise, we 
arrive at a decreased value of uncertainty H(X|Y) by utilizing the information 
I(X,Y) carried by the channel (see Fig. C.1). The mutual information is also 
written as 

IX, Y) = A(Y) ~ A(Y|X). (C.9) 


The channel capacity C is defined as the maximum possible value of the 
mutual information as a function of the input probability distribution: 


C= max LAL). (C.10) 


{input prob} 


The channel capacity represents the maximum possible amount of information 
carried by the channel with a given noise probability. 

The concepts of entropy and information can be applied to continuous dis- 
tributions as well. For a probability distribution density P(x) of a continuous 
stochastic variable X, the entropy is defined by 


H(X) = — f Pe) logs P(x) dz. (C.11) 
The conditional entropy is 
H(Y|X) =- J Poan beban iri (C.12) 
and the mutual information is given as 


I(X,Y) = H(X) — H(X|Y) = H(Y) — H(Y|X). (C.13) 


The channel capacity is the maximum value of mutual information with respect 
to the input. probability distribution function 
C= max (X,Y). (C.14) 


{input prob} 


BSC AND GAUSSIAN CHANNEL 223 


C.3 BSC and Gaussian channel 
Let us calculate the channel capacities of the BSC and Gaussian channel using 
the formulation developed in the previous section. 

We first consider the BSC. Suppose that the input symbol of the channel is 
either 0 or 1 with probabilities r and 1 — r, respectively: 


P(z=0)=r, P(z=1)>l-r. (C.15) 
The channel has a binary symmetric noise: 


P(y = 0ļ|z = 0) = P(y = Iz =1)=1 -p 


Ply = 1a =0) = Ply = 0|x = 1) =p. (C.16) 


Then the probability of the output is easily calculated as 
P(iy=0)=rQ—p)+(1—r)p=r+p-—2rp, P(y=1)=1-—P(y=0). (C.17) 
The relevant entropies are 
H(Y) = —(r + p — 2rp) loga(r + p — 2rp) 
— (1 — r — p + 2rp) log,(1 — r — p + 2rp) 


H(Y|X) = —plogy p — (1 — p) loga(1 — p) = H2(p) (C.18) 
I(X,Y) = H(Y) — H(Y|X). 


The channel capacity is the maximum of J(X,Y) with respect to r. This is 
achieved when r = 1/2 (perfectly random input): 


C = max I(X, Y) = 1 + ploggp + (1 —p)loga(1—p) = 1— Ha(p).  (C.19) 


Let us next investigate the capacity of the Gaussian channel. Suppose that 
the input sequence is generated according to a probability distribution P(x). 
The typical strength (power) of an input signal will be denoted by Jê: 


/ P(x)? dz = Je. (C.20) 


The output Y of the Gaussian channel with noise power J? is described by the 
probability density 


a: 
P(y|z) we} (C.21) 


1 
EOT exp f- JE 


To evaluate the mutual information using the second expression of (C.13), we 
express the entropy of the output using 


P(y) = Í P(y\x)P(x) dx = 5 f ae} Pie) ae (C.22) 
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as 
H(Y) = — | Ply) log» Ply) au. (0.23) 
The conditional entropy is derived Gomi (C.12) and P(x, y) = P(y|x) P(x) as 
H(Y|X) = logs(VanJ) + 225 Eze (C.24) 
Thus the mutual information is 
- f Pw) loga P(y) dy — loga (V 2r J) — Bae (C.25) 


To evaluate the channel capacity, this mutual information should be maximized 
with respect to the input probability P(a), which is equivalent to maximization 
with respect to P(y) according to (C.22). The distribution P(y) satisfies two 
constraints, which is to be taken into account in maximization: 


| P(y) dy = 1, j y’ P(y)dy = J? + JÈ (C.26) 


as can be verified from (C.22) and (C.20). By using Lagrange multipliers to 
reflect the constraints (C.26), the extremization condition 


ay inf PO oge Pew) dy 


=e ( f P(y) dy — 1) = Ny ( f yY Ply) dy — J? = z) =0 (C.27) 


reads 
— logs P(y) — A2y” — const = 0. (C.28) 
The solution is 
P(y) exp { XPF 5} ; (C.29) 


where the constants in (C.28) have been fixed so that the result satisfies (C.26). 
Insertion of this formula into (C.25) immediately yields the capacity as 


1 Jê z 
C= 3 logy | 1+ z) (C.30) 


C.4 Typical sequence and random coding 

We continue to discuss the properties of a sequence of symbols with the input 
and output of a noisy channel in mind. Let us consider a sequence of symbols of 
length M in which the symbol a; appears m; times (i = 1, 2,..., L; M = 30, mi). 
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If M is very large and symbols are generated one by one independently, m; is 
approximately equal to Mp;, where p; is the probability that a; appears in a 
single event. More precisely, according to the weak law of large numbers, the 
inequality 

Mhi 


holds for any positive € if one takes sufficiently large M. 
Then the probability Pyp that a; appears m; times (i = 1,..., L) in the 
yY Ptyp i j 
sequence is 


TTL, ath, 


Ptyp = Py +++ Py 


nm wl Pr MpL 
wo Py roe DP r 
= JM (pi logy pite tps logg pr) 


gre). (C.32) 


A sequence with a; appearing Mp; times (i = 1,..., L) is called the typical 
sequence. All typical sequences appear with the same probability 2-”@4), 

The number of typical sequences is the number of ways to distribute m; of 
the a; among the M symbols (i = 1,..., L): 


M! 
Neg ae 
mal... mg! 


(C.33) 


For sufficiently large M and m,,me,...,mz, we find from the Stirling formula 
and m; = Mpi, 


logs Niyp = M (loga M — 1) — pD miloga Mmi — 1) 


i 
= —M Ý pi logy pi 
= MHU). (C.34) 
Thus Ntyp is the inverse of ptyp, 
Niyp = Be (H) (C.35) 


This result is quite natural as all sequences in the set of typical sequences ap- 
pear with the same probability. Equation (C.35) also confirms that H(U) is the 
uncertainty about the outcome from U. 

We restrict ourselves to binary symbols (0 or 1, for example) for simplicity 
from now on (i.e. L = 2). The set of inputs to the channel is denoted by X and 
that of outputs by Y. Both are composed of sequences of length M. The original 
source message has the length N. Random coding is a method of channel coding 
in which one randomly chooses code words from typical sequences in X. More 
precisely, a source message has the length N and the total number of messages 
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typical sequences 


e ame en ie ah Pa E 


MA(X\Y) 


code word 


(channel input) output 


source message 


Fic. C.2. The original message m has a one-to-one correspondence with a code 
word x. There are 2/7 (*1¥) possible inputs corresponding to an output of 
the channel. Only one of these 2M Ħ(XIY) code words (marked as a dot) should 


be the code word assigned to an original message. 


is 2%. We assign a code word for each of the source messages by randomly 
choosing a typical sequence of length M (> N) from the set X. Note that there 
are 2MĦ(X) typical sequences in X, and only 2" of them are chosen as code 
words. The code rate is R = N/M. This random coding enables us to decode 
the message without errors if the code rate is smaller than the channel capacity 
R < C as shown below. 


C.5 Channel coding theorem 


Our goal is to show that the probability of correct decoding can be made ar- 
bitrarily close to one in the limit of infinite length of code word. Such an ideal 
decoding is possible if only a single code word x (€ X) corresponds to a given 
output of the channel y (€ Y). However, we know that there are 2!@#(41¥) pos- 
sibilities as the input corresponding to a given output. The only way out is that 
none of these 2M Ħ(XIY) sequences are code words in our random coding except 
a single one, the correct input (see Fig. C.2). To estimate the probability of such 
a case, we first note that the probability that a typical sequence of length M is 
chosen as a code word for an original message of length N is 


9N 


JMH A (C.36) 


because 2" sequences are chosen from 2¥#(%), Thus the probability that an 
arbitrary typical sequence of length M is not a code word is 


1 — 9-M[H(X)-R], (C.37) 


Now, we require that the 2¥Ħ(XIY) sequences of length M (corresponding to 


a given output of the channel) are not code words except the single correct one. 
Such a probability is clearly 


aM H(XIY) 1 


a 


E _ RSR 


Peorrect = 
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~ 1 — 2-7 MIH(X)-R-H(XIY)] (C.38) 


a . 


The maximum possible value of Peorreet iS given by replacing H(X) — H(X|Y) 
by its largest value C, the channel capacity: 


max Porm S lao he, (C.39) 


which tends to one as M — oo if R < C. This completes the argument for the 
channel coding theorem. 


APPENDIX D 
DISTRIBUTION AND FREE ENERGY OF K-SAT 


In this appendix we derive the self-consistent equation (9.53) and the equilibrium 
free energy (9.55) of K-SAT from the variational free energy (9.51) under the 
RS ansatz (9.52). 

The function e(o) depends on ø only through the number of down spins 
j in the set o = (04,...,0n) if we assume symmetry between replicas; we thus 
sometimes use the notation c(j) for e(o). The free energy (9.51) is then expressed 
as 


aS = 2G Je ) log c(j) + alog oe Sa (ji) elj) 


j=0 jg=0 
n K 
Pao IL (e-o Taero) , (DA) 
oilj) = o@K (jx) a=l k=l 


where the sum over oj(j;) is for the o; with j; down spins. Variation of (D.1) 
with respect to c(j) yields 


where 


f= So e SS eli)..-e(ix) 


ji0 In =0 


r 
Sy we Y II (rete =o [I etn) (D.3) 
keen] 


elja) okljg)e=Fl 
n n 
GS a D aeae Oe 
j= İK 1 olji) OK-IlÍK 1) 
n K m] 
> JI (: + (e7? — 1)6(o%, 1) [] S(f, 0] (D.4) 
ao(7) a=] kz=1 


These functions f and g are expressed in terms of the local magnetization density 
P(m) defined by 
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co) = is dm P(m) Nee : -i mor (D.5) 
f= f il dmy P(mp)(Ax)” (D.6) 
l pod 
r 1 K-t 
g = (‘ I] dmx P(mk (Ax), (D.7) 
-1 k=] 
where 
5 1 t m 
Ax =1+(e%-1)]] k (D.8) 
k=1 


Equations (D.6) and (D.7) are derived as follows. 
Recalling that the sum over o(j) in g appearing in (D.4) is for the ø with j 
down spins (for which 6(a0%,1) = 0), we find 


g= ae ae c(jk-1) 


=O) jx-1=0 


n! K-1 
De "pf fer ffaas 
k=1 


oilj)  ok-iljk-1)o(j) a51 


=Y X $ co)...clogK-) 


alj) oi OK 1 

nf K-1 
|] ( +(e —1) |] sot.) (D.9) 
a=] k=1 


where the product is over the replicas with o% = 1. If we insert (D.5) into this 
equation and carry out the sums over a, to ox.1, we find 


y Kel n 


g= 3l I] dm, P ii AK-1 


olj) l kel 


" 1 K-1 
= al I] dmy P(mp)(Ax—1)"77, (D.10) 


l k=l 


proving (D.7). Similar manipulations lead to (D.6). 
In the extremization of F with respect to c(j), we should take into account the 
symmetry c(j) = c(n—j) coming from e(o) = c(—o) as well as the normalization 
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condition Y% j=0 G “) e( j) = 1. Using a Lagrange multiplier for the latter and from 
(D.2), the extremization condition is 


1 1-1 
—2 (log e(j) + +Ka f II dm P(mp) f (Ax -1) 7I + (An_1))} — 2A = 0, 
l poy 
(D.11) 
from which we find 
y Kel l 
c(i) -af-a-1 37 z [] dmi P(me)((Ax—1)" 7 + a9) 
l kl 
(D.12) 


The number of replicas n has so far been arbitrary. Letting n — 0, we obtain 
the self-consistent equation for P(m). The value of the Lagrange multiplier A in 
the limit n — 0 is evaluated from (D.12) for j = 0 using c(0) = 1. The result is 
A = Ka — 1, which is to be used in (D.12) to erase À. The distribution P(m) is 
now derived from the inverse relation of 


‘1 : Th J rae: 
elj) = | dm P(m) (+ +") (? 5 z) (D.13) 
—1 T 


in the limit n — 0; that is, 


Pun) = an |. dy c(iy) exp (—iv tos log ee =). (D.14) 
Inserting (D.12) (in the limit n — 0) with » replaced by Ka — 1 into the right 
hand side of the above equation, we finally arrive at the desired relation (9.53) 
for P(m). 

It is necessary to consider the O(n) terms to derive the free energy (9.55) 
expressed in terms of P(m). Let us start from (D.1): 


GF n 
-F-y(" la) loge i) + alog f. (D.15) 
j=0 
The expression (D.6) for f implies that f is expanded in n as 
f=1+na+ O(n?) (D.16) 
a= i il dm, P(m,) log Ax. (D.17) 
l eed 


The first term on the right hand side of (D.15) is, using (D.12), 


“(eu Newel) = 241-37 > ("ets 
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1 K--1 
-f [dmg P(me)((Ax-1)" + (A-1)’). (D.18) 
ml keen 


We should therefore expand A to O(n). For this purpose, we equate (D.12) and 
(D.13) to get 


í i 3 a) per F A 
A+1 exp {(A a/2f) f i kor dmeP(me)((Ax-1)"7 + (Ax-1))} 
fi dm P(m) (142)” J (im) 
(D.19) 
Since the left hand side is independent of 7, we may set j = 0 on the right hand 
side. We then expand the right hand side to O(n) to obtain 


1 
At1l=Ka+n {Ka (-a + z) + log 2 — J dmP(m) log(1 — m?) + O(n’), 
; Fa 
(D.20) 
where 
1 K-I 
b= I I] dm, P(mg) log Ax-1. (D.21) 
SI kmt 
The final term on the right hand side of (D.18) is evaluated as 
n , al ; n—j aAA j 
> C) | dm, P(mx) (- me) (- =) 
j=o IF fm] 
y Kel l l 
f TI dime Penara)" + (Ar) 
Ti kel 
1 K i Amii 
=2 f [lame Pim) (tar 8) 
E f 2 2 
my (D.22) 


Combining (D.15), (D.16), (D.18), (D.20), and (D.22), we find 


Nn 


ae > 1 
ge log 2+a(1 ~ Kya + SS — 5 f dm P(m)log(1 - m?) +0(n), (D.23) 
‘ 1 


which gives the final answer (9.55) for the equilibrium free energy. 
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